Scaling up Geospatial Data Science with Distributed Computing

Pavithra Eswaramoorthy March 24, 2021

, ,


We were recently joined by Brendan Collins, founder and principal at makepath, for a livestream on Scaling up Geospatial Data Science with Distributed Computing. The makepath team specializes in spatial data science challenges and supports open source GIS software with great passion. Brendan talked to Hugo Bowne-Anderson, Head of Marketing and Evangelism at Coiled, about how we can use geospatial tools that run on top of Dask, in scaling to larger problems.

Check out the livestream recap:

In this post, we will:

  • Introduce Xarray-Spatial, a Python library for geospatial analysis,
  • Learn about some Xarray-Spatial tools, and
  • Discuss areas where Xarray-Spatial tools can be used.

Imagine you are analyzing the distribution of agriculture and you want to summarize the crops in a county, or you are allocating energy resources for a city, or you are working on election polls; Xarray-Spatial provides a set of high-level tools that help you to work on these types of challenges. Read on to learn more!

Fundamental Types of GeoSpatial Data

GeoSpatial data has a geographic or location element to it, like maps and addresses. In the geospatial community, there are two types of fundamental data that help represent different phenomena in the world — Vector and Raster.

Vectors are used to describe phenomena that are discrete, like points, lines, and polygons. For example, in a typical map, rivers and roads can be represented using discrete lines, and cities can be represented using points. Vectors are defined using basic coordinates (x, y, z) and lists of coordinates.

Rasters are used to describe continuous phenomena, like rainfall, elevation, soil types, and more. These cannot be comfortably described with vectors. Rasters are fundamentally grids, have a resolution and an origin, and allow us to use linear algebra and matrices over these data.

To draw some analogies, Adobe Illustrator can be considered a vector, whereas Adobe Photoshop will be a raster. Similarly, an SVG file will be vector, whereas a PNG (or JPG) file will be raster. Just like a PNG image has a grid of pixels, we can think of raster as a grid where instead of pixel values, the grid contains data that can be analyzed.

Xarray-Spatial for GeoSpatial Analysis in Python

Xarray-Spatial is an open-source library for raster-based spatial analysis in Python. It scales all its analysis functions, both horizontally and vertically, using Dask and Numba. The tools provided by Xarray-Spatial include 1-D classification tools and focal analytics tools. In focal analytics, you pass a filter (or a kernel) over an entire image to summarize and create a new image based on some criteria. Smoothing filters and mean filters are good examples of focal analytics. You can find the complete list of tools at makepath/xarray-spatial.

Xarray-Spatial is a flat and functional library. The team is building on existing tools like Datashader, Xarray, Dask, Numba, CuPy, and many more to give geospatial users intuitive names while using tools from the broader community. This multidisciplinary aspect of Xarray-Spatial makes it extremely valuable to the geospatial community.

Xarray-Spatial dependency graph.
Direct dependencies: Numba, Datashader, Xarray, CuPy
Xarray-Spatial dependency graph. Source: makepath/xarray-spatial

Xarray-Spatial provides a wide variety of helpful tools to Geo users. Brendan demonstrates these tools in detail in the Livestream. You can follow along with the examples in the Xarray-Spatial user guide!

Using Xarray-Spatial: Surface Tools

Surface tools allow us to visualize terrains and landforms. We start by creating a sample raster using the generate_terrain function and use the shade method (from Datashader) to render and colormap this raster.

Then, we create an elevation colormap and move on to illuminating using hillshade from a specific altitude and direction. We also calculate the `slope` of the terrain that produces another Xarray DataArray, where we can use Xarray or NumPy to continue analyzing.

Another interesting surface tool is viewshed, which can be used to highlight the areas that an observer can see from any particular point on the terrain, as shown below:

Example of calculating viewshed using the observer’s location. Source: user_guide/1_Surface.ipynb

Using Xarray-Spatial: Proximity Tools

Proximity tools help us work with distances. We start with a pandas DataFrame and visualize the points using Datashader. We then use proximity to create an Xarray DataArray of the distance to the nearest point to that pixel. The distance metrics can be specified in the function call.

Make sure to also check out proximity allocation and proximity direction!

Example of creating a proximity grid. Source: user_guide/2_Proximity.ipynb

Using Xarray-Spatial: Zonal Tools

Zonal tools allow you to summarize a set of value rasters by a set of zones. Continuing with the sample terrain, consider you’re planning a hike and you have some hiking paths (segments) marked for each day. We can calculate summary statistics for each segment using zonal tools. We can also write custom zonal functions!

Example of Zonal Statistics. Source: user_guide/3_Zonal.ipynb

Using Xarray-Spatial: Classification Tools and Pathfinding

Classification tools allow us to summarize a raster, based on some binning methods. In the brief example, we apply a quantile binning strategy to elevation. Pathfinding tools can be used to find optimal paths between nodes in a graph. Brendan says pathfinding tools are slightly academic at the moment, but there is active work to create optimal pathfinding for Dask.

Example of Pathfinding. Source: user_guide/9_Pathfinding.ipynb

What excites Brendan about the Future of Xarray-Spatial?

Brendan is excited to see Xarray-Spatial tools being applied to real-world problems, where Xarray-Spatial is just a small piece in the background. Specifically, Brendan hopes to see Xarray-Spatial being adopted in the Pangeo community – a project that promotes open, reproducible, and scalable geoscience. Thank you for joining us, Brendan!

At Coiled, we love to see Dask being used in scientific applications. Xarray-Spatial is a wonderful example of how Dask has become infrastructural and is used behind-the-scenes in many important tools. These applications inspire us to continue developing Dask and supporting the Dask community. We’re currently working on a product to make Dask cloud deployments easy, check it out below!