A look at commuter data in Kansas City

Christiana Cromer October 26, 2020

Large Scale Machine Learning for Urban Planning

, , , , , ,

The Coiled team was recently joined by Brett Naul, founding engineer at Replica, where we discussed large-scale machine learning and travel simulations for urban planning. During this session, we learned more about:

  • Interactive products for urban planning,
  • Building synthetic populations from large data sets like the US census,
  • Data engineering workflow with Dask and Prefect, and
  • Pairing Google BigQuery and Dask.

You can watch the live stream below and read on for a full summary for this week’s edition of Science Thursday.

Replica uses data to empower urban planners with information about how people interact and move through cities. After learning about Brett’s background with Sidewalk Labs (Alphabet’s urban innovation organization) and Replica’s mission, we jumped into examples of just how Brett leverages tools like Dask, Prefect, and Google BigQuery for machine learning in urban planning. 

Interactive Tools for Urban Planning 

To start, we got an introduction to the tools Replica uses to model behavior in cities. Brett gave us an example of an urban planner considering instituting a toll in the greater Kansas City area, and how Replica can model the different outcomes this could lead to. Will the toll disproportionately impact lower income areas? What’s the flow of commuter traffic like there? This empowers the urban planner to make the best and most equitable decision for their city.

A look at the data relevant to deciding if a toll on the I-35 in Kansas City

Data Sources and Privacy

How Replica’s synthetic models validate privacy concerns

Naturally, we wondered where Replica sources the data they use to provide these insights and the privacy implications attached to sets of this scale. “All of our data comes out of models,” noted Brett, “We ingest a wide variety of data sources and we train machine learning models to predict what real people might do, and the predictions of those models feed the final outputs and various reports we generate.” No actual behaviors made by real people are used in Replica’s data, which Brett notes as a selling point in terms of privacy, all in hopes that this additional layer of obfuscation doesn’t impact the accuracy of reports. 

By aggregating information from sources like the US Census, Replica provides distribution insights into characteristics like ages and incomes, property ownership and vehicle usage, and so on. Simultaneously, using public-use records from the Census on individual households, Replica accesses fine-grained information, while no specific person can be re-defined. Then, through training Bayesian networks, Replica combines all of this data to get a representation of what these demographic variables look like. 

Building synthetic populations using Pomegranate, Doppelganger, and Dask

Next, Brett modeled the way Replica uses tools like Pomegranate, Doppelganger and Dask to build out a groundwork of information that can then be leveraged to answer specific questions urban planners may have. “To get our final result from these models, we would submit one more function that generates a bunch of samples, basically like a csv kind of record style representation of all of the people in a given neighborhood, and we would loop over that. We could combine them into a Dask DataFrame or some kind of collection. We can write them to a database or to cloud storage.” 

Data Engineering Workflow with Dask and Prefect

Simulating real behavior for urban planning decisions in Lincoln, Nebraska 

Then, Brett dove into an example using the city of Lincoln, Nebraska to showcase how Replica makes modeling decisions about the way people commute to work and what infrastructure they use to get there. He showed how using Graphhopper, an open source routing library, and  Kepler, a viz library that came out of Uber, Replica can simulate the commuting behavior of people in this region using Dask. “This is really just data exploration, but in production when we’re going to train a model using this information, that’s the part that makes use of Prefect”.  With the commute summary statistics from this step, Prefect provides a nice way to do this workflow management and orchestration using all of the same Dask infrastructure. Matt noted that he and the other Dask developers were relieved when Prefect was created so they could stop worrying about the requested features it solved for. For more context on Prefect, check out our past livestream with Chris White, CTO of Prefect. 

To summarize: Brett has one Prefect flow that has many tasks within itself and each of those tasks might create some other Dask objects, all of which run on one Dask cluster.

Dask DataFrame load applying a function 

Google BigQuery and Dask 

Making behavior predictions by training models and utilizing large-scale distributed shuffles

The last example Brett showed us involved what he described as a novel interaction between Google BigQuery and Dask. “We’ve spent a lot of time investing in a kind of interface layer between BigQuery and Dask,” he said. Brett showed us a table of mobile location data for Lincoln, Nebraska which we would train models on and then pair with our synthetic population in order to make predictions about the overall populations behavior. The Replica team found that BigQuery is really good at doing large scale distributed shuffles in order to group by a key and to aggregate all the data for that key on a specific node. “It’s really awesome to see tools like BigQuery take on standards like arrows which allow for these sorts of unanticipated connections,” Matt noted. 

Replica then takes all this mobile location data and transforms it into a more parsimonious and interesting representation of people’s mobility. 

The Kepler map of what this output looks like.

As we wrapped up this session, Matt put it well by saying “You used four different examples…and you’ve found a variety of ways of using Dask to solve a variety of problems, all of which were necessary to solve your end goal of understanding urban transit. It shows how many small problems you need to solve a big problem.” 

Turbocharge Your Data Science

A big thank you to Brett Naul for spending time with us! If you want to learn more about Replica and data science in urban planning, be sure to head over to their website.

We hope you enjoyed this recap!

You too can accelerate your data science today on a one-click hosted cluster with Coiled Cloud. Coiled also handles security, conda/docker environments, and team management, so you can get back to doing data science. Get started for free today on Coiled Cloud.