Perform a Spatial Join in Python

This blog explains how to perform a spatial join in Python. Knowing how to perform a spatial join is an important asset in your data-processing toolkit: it enables you to join two datasets based on spatial predicates. For example, you can join a point-based dataset with a polygon-based dataset based on whether the points fall within the polygon.

‍

	GeoPandas	Dask-GeoPandas
Perform Spatial Join
Support Geographic Data
Parallel Processing
Larger-than-memory Datasets

‍

In this blog, we will use the New York City Taxi and Limousine Commission dataset to illustrate performing a spatial join in Python. We will join the taxi pickup location of each individual trip record to the NYC neighborhood it is located within. This will then allow us to bring in public health data at the neighborhood level and discover interesting patterns in the data that we could not have seen without performing a spatial join.

Spatial Join: Different Python Libraries

There are several different ways to perform a spatial join in Python. This blog will walk you through two popular open-source Python you can use to perform spatial joins:

Geopandas
Dask-GeoPandas

Dask-GeoPandas has recently undergone some impressive changes due to the hard work of maintainers Joris Van den Bossche, Martin Fleischmann and Julia Signell.

For each library, we will first demonstrate and explain the syntax, and then tell you when and why you should use it. You can follow along in this Jupyter notebook.

Let’s jump in.

Spatial Join: GeoPandas

GeoPandas extends pandas DataFrames to make it easier to work with geospatial data directly from Python, instead of using a dedicated spatial database such as PostGIS.

A GeoPandas GeoDataFrame looks and feels a lot like a regular pandas DataFrame. The main difference is that it has a geometry column that contains shapely geometries. This column is what enables us to perform a spatial join.

Sidenote: geometry is just the default name of this column, a geometry column can have any name. There can also be multiple geometry columns, but one is always considered the "primary" geometry column on which the geospatial predicates operate.

Let’s load in the NYC TLC data. We’ll start by working with first 100K trips made in January 2009:

import pandas as pd
df = pd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2009-01.csv",
    nrows=100_000
)

‍

The dataset requires some basic cleaning up which we perform in the accompanying notebook.

Next, let’s read in our geographic data. This is a shapefile sourced from the NYC Health Department and contains the shapely geometries of 35 New York neighborhoods. We’ll load this into a GeoPandas GeoDataFrame:

import geopandas as gpd
ngbhoods = gpd.read_file(
    "../data/CHS_2009_DOHMH_2010B.shp"
)[["geometry", "FIRST_UHF_", "UHF_CODE"]]

‍

Let’s inspect some characteristics of the nghboods object including the object type, how many rows it contains, and the first 3 rows of data:

type(ngbhoods)
geopandas.geodataframe.GeoDataFrame

len(ngbhoods
35

ngbhoods.head(3)

‍

‍

ngbhoods is a GeoPandas GeoDataFrame containing the shapely polygons that define the 35 New York U neighborhoods as defined by the United Hospital Fund, as well as their names and unique ID codes.

A neat trick to know about GeoPandas is that calling .plot() on a GeoDataFrame will automatically plot the geometries of the shapes contained:

ngbhoods.plot()

‍

‍

When joining spatial datasets, it’s important to standardize the Coordinate Reference System (CRS). A CRS defines how coordinates (numbers like 42.01 or 180000) map to points on the surface of the Earth. Discussing why you’d pick a particular CRS is out of scope for this post, but you visit these links to learn more. What is important is that the two datasets you want to join together have the same CRS. Joining datasets with mismatched CRSs will lead to incorrect results downstream.

The NYC TLC data uses latitude and longitude, with the WGS84 datum, which is the EPSG 4326 coordinate reference system. Let’s inspect the CRS of our ngbhoods GeoDataFrame:

ngbhoods.crs
<Derived Projected CRS: EPSG:2263>

‍

The CRSs don’t match, so we’ll have to transform one of them. It’s more efficient to transform a smaller rather than a larger dataset, so let’s project ngbhoods to EPSG 4326:

ngbhoods = ngbhoods.to_crs(epsg=4326)

‍

Sidenote: if you are working with multiple GeoDataFrames you can set the CRS of one gdf to that of another using gdf1.to_crs(gdf2.crs) without worrying about the specific CRS numbers.

Calling head() on ngbhoods now will reveal that the values for the geometry column have changed:

‍

‍

Plotting the GeoDataFrame confirms that our spatial information is still intact. But if you look at the axes, you’ll notice the coordinates are now in degrees latitude and longitude instead of meters northing and easting:

‍

‍

Next, let's take a look at the regular pandas DataFrame containing our taxi trip data. You'll notice that this DataFrame also contains geographic information in the pickup_longitude/latitude and dropoff_longitude/latitude columns.

df.head(3)

‍

‍

Looking at the dtypes, however, reveals them to be regular numpy floats:

df.dtypes

vendor_id              object
pickup_datetime        object
dropoff_datetime       object
passenger_count         int64
trip_distance         float64
pickup_longitude      float64
pickup_latitude       float64
rate_code               int64
store_and_fwd_flag     object
dropoff_longitude     float64
dropoff_latitude      float64
payment_type           object
fare_amount           float64
surcharge             float64
mta_tax               float64
tip_amount            float64
tolls_amount          float64
total_amount          float64
dtype: object

‍

Calling plot() on the latitude and longitude columns does not give us a meaningful spatial representation. Let’s see how latitude / longitude values are difficult to interpret without a geospatial library.

Start by plotting the latitude / longitude values:

df.plot(
    kind='scatter', x="pickup_longitude", y="pickup_latitude"
)

‍

‍

This result is difficult to interpret, partly because there are clearly some data points with incorrect coordinate values in the dataset. This won’t be a problem for our particular use case, but may be worth paying attention to if you’re using this data for other analyses.

Let's turn df into a GeoPandas GeoDataFrame, so we can leverage the built-in geospatial functions that’ll make our lives a lot easier. Convert the separate pickup_longitude and pickup_latitude columns into a single geometry column containing shapely Point data:

from geopandas import points_from_xy
# turn taxi_df into geodataframe
taxi_gdf = gpd.GeoDataFrame(
    df, crs="EPSG:4326",
    geometry=points_from_xy(
        df["pickup_longitude"], df["pickup_latitude"]
    ),
)

‍

Take a look at first 3 rows:

taxi_gdf.head(3)

‍

‍

We now have a GeoDataFrame with a dedicated geometry column.

Calling .explore() on a GeoDataFrame allows you to explore the dataset on a map. Run this notebook to interactively explore the map yourself:

taxi_gdf.explore(
    tiles="CartoDB positron", # use "CartoDB positron" tiles
    cmap="Set1", # use "Set1" matplotlib colormap
    style_kwds=dict(color="black") # use black outline
)

‍

‍

Now that both of your datasets have geometry columns, you can perform a spatial join.

Perform a left spatial join with taxi_gdf as the left_df and ngbhoods as the right_df. Setting the predicate keyword to 'within' will join points in the left_df to polygons from the right_df they are located within:

%%timeit
joined = gpd.sjoin(
    taxi_gdf, 
    ngbhoods, 
    how='left',
    predicate="within",
)

1.13 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

‍

Clean up and clarify some column names:

# drop index column
joined = joined.drop(columns=['index_right'])

# rename columns
joined.rename(
    columns={"FIRST_UHF_": "pickup_ngbhood_name", "UHF_CODE": "pickup_ngbhood_id"},
    inplace=True,
)

‍

Now that you've performed the spatial join, you can analyze spatial patterns in the data.

For example, you can run a groupby that aggregates the mean trip_distance per pickup neighborhood:

res = joined.groupby('pickup_ngbhood_name').trip_distance.mean().sort_values()
res

//
pickup_ngbhood_name
Rockaway 18.640000
Southwest Queens 7.583590
Bayside - Meadows 5.461818
East New York 5.195714
Jamaica 4.933902
Washington Heights - Inwood 4.746627
Pelham - Throgs Neck 4.478235
Southeast Queens 4.322500
Kingsbridge - Riverdale 4.140000
Downtown - Heights - Slope 3.833111
Flushing - Clearview 3.761765
Sunset Park 3.612857
Ridgewood - Forest Hills 3.311461
Canarsie - Flatlands 3.290000
Greenpoint 3.153171
Long Island City - Astoria 3.055458
Coney Island - Sheepshead Bay 3.036667
West Queens 3.034350
Central Harlem - Morningside Heights 2.866855
Union Square, Lower Manhattan 2.721932
Bedford Stuyvesant - Crown Heights 2.643019
East Harlem 2.638558
Williamsburg - Bushwick 2.638261
South Bronx 2.622375
Northeast Bronx 2.350000
Chelsea - Village 2.317825
Borough Park 2.284468
Upper West Side 2.251054
Upper East Side - Gramercy 2.104211
East Flatbush - Flatbush 2.071429
Fordham - Bronx Park 2.032500
Bensonhurst - Bay Ridge 1.505625
Southern SI 0.500000
Northern SI 0.000000
Name: trip_distance, dtype: float64//]]>

Rockaway clearly jumps out as the Pickup neighborhood with the longest trip distances in this dataset. For some context: Rockaway is a very faraway neighborhood with a beach destination. Looks like lots of people are taking a cab from the beach 🏖

Let's visualize this data. Start with a simple plot coloring the pickups by neighborhood:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(20, 8))
joined.plot(ax=ax, alpha=0.5, markersize=2, column="pickup_ngbhood_name", legend=False)
plt.title("Pickups Colored by Neighborhood", fontsize=20);

‍

‍

Let’s also color the neighborhoods by their mean trip distance.

To do this, we'll join the res Series containing the mean trip_distance per neighborhood to the ngbhoods GeoDataFrame:

gdf_res = ngbhoods.merge(res, left_on="FIRST_UHF_", right_index=True)

‍

Note that this is a regular join and not a spatial join. We are simply merging the mean trip_distance values to the existing GeoDataFrame on non-spatial columns containing the neighborhood names.

Now create your plot:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(20, 8))
gdf_res.plot(
    ax=ax,
    alpha=0.5,
    markersize=2,
    column="trip_distance",
    legend=True,
    cmap="plasma",
)

plt.title("Average Trip Distance by Departure Neighborhood", fontsize=20);

‍

‍

Spatial Join: Scaling Limitations with pandas

GeoPandas is great. But because it simply extends the pandas architecture, it also inherits the same scalability issues that pandas has. pandas processes data on a single core. This means you are limited in the amount of data you can process. The general rule of thumb is that you need 5x the size of your dataset in RAM.

‍

‍

This means that if you want to process the NYC TLC data for a 5-year period (~120GB on disk) you’d need about 600GB of RAM (!)

Most personal computers don't have the memory resources to process a dataset of that size locally and will throw a MemoryError:

df = pd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/*",
)

‍

‍

Dask DataFrames allow you to scale your pandas workflows. Dask DataFrames overcome two key limitations of pandas:

pandas cannot run on datasets larger than memory
pandas only uses one core when running analyses, which can be slow

Dask DataFrames are designed to overcome these pandas limitations. They can be run on datasets that are larger than memory and use all cores by default for fast execution.

‍

‍

Spatial Join: Scaling up with Dask-GeoPandas

Dask-GeoPandas scales GeoPandas to larger-than-memory datasets using out-of-core computations. GeoPandas and Dask DataFrames are both extensions of pandas DataFrame. Because Dask is flexible and fully written in Python, the same way Dask scales pandas can also be applied to GeoPandas.

Let's see this in action. You can follow along in this notebook.

We're going to be working at scale now with the data for the five-year period from 2009-2013. This data totals ~120GB on disk. We’ll need some extra computing power to run this analysis. Let's scale out to a Dask cluster with Coiled.

import coiled

cluster = coiled.Cluster(
    name="spatial-join",
    software="coiled-examples/spatial-join",
    n_workers=50,
    worker_memory="16Gib",
)

‍

The code above spins up a cluster of 50 workers with 16GiB of RAM each, for a total of 800GB of RAM. That should be plenty to comfortably process the 120GB dataset we'll be working with. You’ll need to sign up for a Coiled account to run this code.

Now connect your local Dask session to the Coiled cluster so that all future Dask computations are run in the cloud instead of locally:

from distributed import Client
client = Client(cluster)

‍

Now load the full 120GB dataset into a Dask DataFrame:

import dask.dataframe as dd

ddf = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/*",
    engine="pyarrow",
    storage_options={"anon": True},
)

‍

Your data is now loaded into a Dask DataFrame. Dask follows the pandas API you're familiar with, so working with Dask DataFrame will look and feel a lot like pandas:

ddf.head()

‍

‍

ddf.groupby("passenger_count").tip_amount.mean().compute()

passenger_count
0      0.810627
1      0.988850
2      0.892365
3      0.811833
4      0.699543
         ...   
66     1.500000
177    1.000000
247    2.300000
249    0.000000
254    0.000000
Name: tip_amount, Length: 62, dtype: float64

‍

Note that Dask operates using lazy evaluation which means you need to explicitly trigger computations using a call to compute(). Read The Beginner's Guide to Distributed Computing to learn more about these key concepts.

We now have the two datasets we want to combine: a small GeoPandas DataFrame containing the health data at neighborhood level and the 120GB of taxi cab data loaded into a Dask DataFrame.

Dask parallelizes Python code. It generally does this by turning large operations into smaller chunks that can be processed in parallel. This parallelism gives you two important benefits: you can perform computations faster and on datasets that are larger than your local memory.

In a Dask DataFrame these smaller chunks are called partitions. Each Dask DataFrame partition is a pandas DataFrame.

‍

‍

In a Dask-GeoPandas GeoDataFrame, each partition is a GeoPandas GeoDataFrame.

The first step is to convert your Dask DataFrame into a Dask-GeoPandas GeoDataFrame:

import dask_geopandas

ddf = dask_geopandas.from_dask_dataframe(
    ddf,
    geometry=dask_geopandas.points_from_xy(ddf, "pickup_longitude", "pickup_latitude"),
)

ddf.head(3)

‍

‍

Our Dask DataFrame now has a geometry column.

Next, we'll have to set the CRS to match that of the `health` GeoDataFrame:

ddf = ddf.set_crs(4326)

‍

Before we perform the spatial join on our Coiled cluster, we need to make sure that all of the workers in our cluster have a copy of the health data. This is currently not the case because we loaded it into a regular GeoPandas DataFrame; it's sitting in local memory and our cloud cluster does not have access to it.

Then perform the spatial join:

%%time
joined = ddf.sjoin(ngbhoods, predicate="within")

CPU times: user 23.4 ms, sys: 2.26 ms, total: 25.7 ms
Wall time: 24.2 ms

Dask holds off on actually performing the join until we explicitly call compute() on a result downstream.

Let's perform a groupby on the joined data:

%%time
joined.groupby("nhbd_name").tip_amount.mean().compute()

CPU times: user 666 ms, sys: 56.2 ms, total: 722 ms
Wall time: 1min 28s

nhbd_name
Bayside - Meadows                       1.523970
Bedford Stuyvesant - Crown Heights      1.170296
Bensonhurst - Bay Ridge                 1.043649
Borough Park                            0.970536
Canarsie - Flatlands                    1.027048
Central Harlem - Morningside Heights    1.014322
Chelsea - Village                       0.875355
Coney Island - Sheepshead Bay           0.949707
Downtown  - Heights - Slope             1.246141
East Flatbush - Flatbush                0.928637
East Harlem                             0.798915
East New York                           1.082279
Flushing - Clearview                    0.914412
Fordham - Bronx Park                    0.898696
Greenpoint                              1.172317
Jamaica                                 2.158574
Kingsbridge - Riverdale                 1.054347
Long Island City - Astoria              0.909555
Northeast Bronx                         0.971505
Northern SI                             1.056143
Pelham - Throgs Neck                    0.915107
Ridgewood - Forest Hills                0.807874
Rockaway                                1.371856
South Bronx                             0.751648
Southeast Queens                        1.222044
Southern SI                             1.339840
Southwest Queens                        2.373951
Sunset Park                             1.137424
Union Square, Lower Manhattan           1.025017
Upper East Side - Gramercy              0.813856
Upper West Side                         0.809046
Washington Heights - Inwood             1.309627
West Queens                             0.993301
Williamsburg - Bushwick                 1.118900
Name: tip_amount, dtype: float64

‍

This groupby query on over 120GB of data runs in less than two minutes. Combining the parallel power of dask-geopandas and Coiled can lead to massive speed-ups in processing your geospatial data. This is particularly useful when performing spatial joins with large datasets.

Spatial Join: the Final Verdict

Spatial joins are powerful because they allow you to join datasets based on geographic data.
GeoPandas is great for performing a spatial join on datasets that fit into memory.
Dask-GeoPandas scales GeoPandas for larger-than-memory datasets using the parallel processing power of Dask.
You can speed up a spatial join by running your computations in the cloud using a Coiled Cluster. Get started with Coiled by creating an account here.

	GeoPandas	Dask-GeoPandas	on Coiled
Perform Spatial Join
Support Geographic Data
Parallel Processing
Larger-than-memory Datasets
Scale to the cloud