This blog explains how to perform a spatial join in Python. Knowing how to perform a spatial join is an important asset in your data-processing toolkit: it enables you to join two datasets based on spatial predicates. For example, you can join a point-based dataset with a polygon-based dataset based on whether the points fall within the polygon.
In this blog, we will use the New York City Taxi and Limousine Commission dataset to illustrate performing a spatial join in Python. We will join the taxi pickup location of each individual trip record to the NYC neighborhood it is located within. This will then allow us to bring in public health data at the neighborhood level and discover interesting patterns in the data that we could not have seen without performing a spatial join.
There are several different ways to perform a spatial join in Python. This blog will walk you through two popular open-source Python you can use to perform spatial joins:
Dask-GeoPandas has recently undergone some impressive changes due to the hard work of maintainers Joris Van den Bossche, Martin Fleischmann and Julia Signell.
For each library, we will first demonstrate and explain the syntax, and then tell you when and why you should use it. You can follow along in this Jupyter notebook.
Let’s jump in.
GeoPandas extends pandas DataFrames to make it easier to work with geospatial data directly from Python, instead of using a dedicated spatial database such as PostGIS.
A GeoPandas GeoDataFrame looks and feels a lot like a regular pandas DataFrame. The main difference is that it has a geometry column that contains shapely geometries. This column is what enables us to perform a spatial join.
Sidenote: geometry is just the default name of this column, a geometry column can have any name. There can also be multiple geometry columns, but one is always considered the "primary" geometry column on which the geospatial predicates operate.
Let’s load in the NYC TLC data. We’ll start by working with first 100K trips made in January 2009:
The dataset requires some basic cleaning up which we perform in the accompanying notebook.
Next, let’s read in our geographic data. This is a shapefile sourced from the NYC Health Department and contains the shapely geometries of 35 New York neighborhoods. We’ll load this into a GeoPandas GeoDataFrame:
Let’s inspect some characteristics of the nghboods object including the object type, how many rows it contains, and the first 3 rows of data:
ngbhoods is a GeoPandas GeoDataFrame containing the shapely polygons that define the 35 New York U neighborhoods as defined by the United Hospital Fund, as well as their names and unique ID codes.
A neat trick to know about GeoPandas is that calling .plot() on a GeoDataFrame will automatically plot the geometries of the shapes contained:
When joining spatial datasets, it’s important to standardize the Coordinate Reference System (CRS). A CRS defines how coordinates (numbers like 42.01 or 180000) map to points on the surface of the Earth. Discussing why you’d pick a particular CRS is out of scope for this post, but you visit these links to learn more. What is important is that the two datasets you want to join together have the same CRS. Joining datasets with mismatched CRSs will lead to incorrect results downstream.
The NYC TLC data uses latitude and longitude, with the WGS84 datum, which is the EPSG 4326 coordinate reference system. Let’s inspect the CRS of our ngbhoods GeoDataFrame:
The CRSs don’t match, so we’ll have to transform one of them. It’s more efficient to transform a smaller rather than a larger dataset, so let’s project ngbhoods to EPSG 4326:
Sidenote: if you are working with multiple GeoDataFrames you can set the CRS of one gdf to that of another using gdf1.to_crs(gdf2.crs) without worrying about the specific CRS numbers.
Calling head() on ngbhoods now will reveal that the values for the geometry column have changed:
Plotting the GeoDataFrame confirms that our spatial information is still intact. But if you look at the axes, you’ll notice the coordinates are now in degrees latitude and longitude instead of meters northing and easting:
Next, let's take a look at the regular pandas DataFrame containing our taxi trip data. You'll notice that this DataFrame also contains geographic information in the pickup_longitude/latitude and dropoff_longitude/latitude columns.
Looking at the dtypes, however, reveals them to be regular numpy floats:
Calling plot() on the latitude and longitude columns does not give us a meaningful spatial representation. Let’s see how latitude / longitude values are difficult to interpret without a geospatial library.
Start by plotting the latitude / longitude values:
This result is difficult to interpret, partly because there are clearly some data points with incorrect coordinate values in the dataset. This won’t be a problem for our particular use case, but may be worth paying attention to if you’re using this data for other analyses.
Let's turn df into a GeoPandas GeoDataFrame, so we can leverage the built-in geospatial functions that’ll make our lives a lot easier. Convert the separate pickup_longitude and pickup_latitude columns into a single geometry column containing shapely Point data:
Take a look at first 3 rows:
We now have a GeoDataFrame with a dedicated geometry column.
Calling .explore() on a GeoDataFrame allows you to explore the dataset on a map. Run this notebook to interactively explore the map yourself:
Now that both of your datasets have geometry columns, you can perform a spatial join.
Perform a left spatial join with taxi_gdf as the left_df and ngbhoods as the right_df. Setting the predicate keyword to 'within' will join points in the left_df to polygons from the right_df they are located within:
Clean up and clarify some column names:
Now that you've performed the spatial join, you can analyze spatial patterns in the data.
For example, you can run a groupby that aggregates the mean trip_distance per pickup neighborhood:
//
pickup_ngbhood_name
Rockaway 18.640000
Southwest Queens 7.583590
Bayside - Meadows 5.461818
East New York 5.195714
Jamaica 4.933902
Washington Heights - Inwood 4.746627
Pelham - Throgs Neck 4.478235
Southeast Queens 4.322500
Kingsbridge - Riverdale 4.140000
Downtown - Heights - Slope 3.833111
Flushing - Clearview 3.761765
Sunset Park 3.612857
Ridgewood - Forest Hills 3.311461
Canarsie - Flatlands 3.290000
Greenpoint 3.153171
Long Island City - Astoria 3.055458
Coney Island - Sheepshead Bay 3.036667
West Queens 3.034350
Central Harlem - Morningside Heights 2.866855
Union Square, Lower Manhattan 2.721932
Bedford Stuyvesant - Crown Heights 2.643019
East Harlem 2.638558
Williamsburg - Bushwick 2.638261
South Bronx 2.622375
Northeast Bronx 2.350000
Chelsea - Village 2.317825
Borough Park 2.284468
Upper West Side 2.251054
Upper East Side - Gramercy 2.104211
East Flatbush - Flatbush 2.071429
Fordham - Bronx Park 2.032500
Bensonhurst - Bay Ridge 1.505625
Southern SI 0.500000
Northern SI 0.000000
Name: trip_distance, dtype: float64//]]>
Rockaway clearly jumps out as the Pickup neighborhood with the longest trip distances in this dataset. For some context: Rockaway is a very faraway neighborhood with a beach destination. Looks like lots of people are taking a cab from the beach 🏖
Let's visualize this data. Start with a simple plot coloring the pickups by neighborhood:
Let’s also color the neighborhoods by their mean trip distance.
To do this, we'll join the res Series containing the mean trip_distance per neighborhood to the ngbhoods GeoDataFrame:
Note that this is a regular join and not a spatial join. We are simply merging the mean trip_distance values to the existing GeoDataFrame on non-spatial columns containing the neighborhood names.
Now create your plot:
GeoPandas is great. But because it simply extends the pandas architecture, it also inherits the same scalability issues that pandas has. pandas processes data on a single core. This means you are limited in the amount of data you can process. The general rule of thumb is that you need 5x the size of your dataset in RAM.
This means that if you want to process the NYC TLC data for a 5-year period (~120GB on disk) you’d need about 600GB of RAM (!)
Most personal computers don't have the memory resources to process a dataset of that size locally and will throw a MemoryError:
Dask DataFrames allow you to scale your pandas workflows. Dask DataFrames overcome two key limitations of pandas:
Dask DataFrames are designed to overcome these pandas limitations. They can be run on datasets that are larger than memory and use all cores by default for fast execution.
Dask-GeoPandas scales GeoPandas to larger-than-memory datasets using out-of-core computations. GeoPandas and Dask DataFrames are both extensions of pandas DataFrame. Because Dask is flexible and fully written in Python, the same way Dask scales pandas can also be applied to GeoPandas.
Let's see this in action. You can follow along in this notebook.
We're going to be working at scale now with the data for the five-year period from 2009-2013. This data totals ~120GB on disk. We’ll need some extra computing power to run this analysis. Let's scale out to a Dask cluster with Coiled.
The code above spins up a cluster of 50 workers with 16GiB of RAM each, for a total of 800GB of RAM. That should be plenty to comfortably process the 120GB dataset we'll be working with. You’ll need to sign up for a Coiled account to run this code.
Now connect your local Dask session to the Coiled cluster so that all future Dask computations are run in the cloud instead of locally:
Now load the full 120GB dataset into a Dask DataFrame:
Your data is now loaded into a Dask DataFrame. Dask follows the pandas API you're familiar with, so working with Dask DataFrame will look and feel a lot like pandas:
Note that Dask operates using lazy evaluation which means you need to explicitly trigger computations using a call to compute(). Read The Beginner's Guide to Distributed Computing to learn more about these key concepts.
We now have the two datasets we want to combine: a small GeoPandas DataFrame containing the health data at neighborhood level and the 120GB of taxi cab data loaded into a Dask DataFrame.
Dask parallelizes Python code. It generally does this by turning large operations into smaller chunks that can be processed in parallel. This parallelism gives you two important benefits: you can perform computations faster and on datasets that are larger than your local memory.
In a Dask DataFrame these smaller chunks are called partitions. Each Dask DataFrame partition is a pandas DataFrame.
In a Dask-GeoPandas GeoDataFrame, each partition is a GeoPandas GeoDataFrame.
The first step is to convert your Dask DataFrame into a Dask-GeoPandas GeoDataFrame:
Our Dask DataFrame now has a geometry column.
Next, we'll have to set the CRS to match that of the `health` GeoDataFrame:
Before we perform the spatial join on our Coiled cluster, we need to make sure that all of the workers in our cluster have a copy of the health data. This is currently not the case because we loaded it into a regular GeoPandas DataFrame; it's sitting in local memory and our cloud cluster does not have access to it.
Then perform the spatial join:
Dask holds off on actually performing the join until we explicitly call compute() on a result downstream.
Let's perform a groupby on the joined data:
This groupby query on over 120GB of data runs in less than two minutes. Combining the parallel power of dask-geopandas and Coiled can lead to massive speed-ups in processing your geospatial data. This is particularly useful when performing spatial joins with large datasets.
- Beginner's Guide to Distributed Computing