Democratizing Satellite Imagery Analysis with Dask
• May 11, 2021
Pictures from space are interesting, but they can be difficult to access and analyze. Gabe Joseph, an open source engineer at Coiled and the developer of stackstac, joined us for a webinar to discuss how Dask and stackstac — a library that makes cloud-native geoprocessing easier, can make these global-scale satellite imagery data more accessible to the general public.
In this blog post, we highlight:
- The challenges of accessing and analyzing satellite images;
- The new open standards: COGs and STACs;
- What is stackstac, and how to use it to view satellite images live on an interactive map!
Satellite Imagery – Introduction and Challenges
Satellite images provide a fascinating perspective of Earth. These images are valuable for numerous scientific purposes, including remote sensing, but they can be difficult to use and analyze. In the mid-late 1900s, satellite pictures were dropped from space in film buckets and caught with an airplane. Today, getting satellite images is easier than that, but it’s still tricky in a different way.
There is a lot of remote sensing and earth observation data available right now. For example, the Google Earth Engine Catalog has over 30 petabytes of data and is growing at a rate of almost a petabyte per month! This is more data than we can practically work with, or even reasonably use. The current paradigm of “download the files you need” doesn’t scale in such cases.
A fundamental challenge while accessing satellite data is that different data providers (NASA, European Space Agency, etc.) do things differently. They share data at different places, they tile things differently, at different resolutions, and with different metadata. This causes many inconsistencies, and users need to figure out every small detail from scratch each time. Adding to this, using GDAL for analysis is also not straightforward. GDAL is a very powerful library, but we need to know all its tiny quirks to use it effectively.
“There are so many interesting things you can learn from earth observation data, that we just don’t do because it’s such a pain to do it! No one wants to deal with that, no one wants to do that.”
This has been changing with new open “cloud-native” standards: Cloud Optimized GeoTIFFs (COG) and SpatioTemporal Asset Catalog (STAC). These standards are trying to solve some of the above-mentioned challenges, and are making it easier to work with geospatial data.
Cloud Optimized GeoTIFFs
Cloud Optimized GeoTIFF (COG) is a way to format GeoTIFFs such that accessing the necessary data becomes more efficient. Let’s break this term down:
- GeoTIFF is a standard format for representing geospatial raster data (rasters are pictures with geolocation). It is a familiar format that geospatial tools can consume and produce.
- Cloud Optimized refers to an internal organization of files that make it easy to download only the parts of the data that you require, instead of the entire dataset.
Gabe shares an analogy of how a COG is like a grocery list, where the list has items for each section like dairy products or vegetables. On the contrary, the classically formatted GeoTIFF is like a scavenger hunt – imagine finding a product, and the next item to find is the third ingredient used in this product, and so on. COGs make it easier for software tools to work with geospatial data over the internet.
COGs also split data into tiles, unlike the traditional format where we have to download an entire row even if we require only a small part of it. Gabe shares another analogy:
“If you go to Costco and buy a 5-gallon bucket of mayonnaise for a recipe where you need only a teaspoon, that’s wasteful, instead, you can just buy as much as you need!”
SpatioTemporal Asset Catalogs
SpatioTemporal Asset Catalogs (STACs) are a standard format for organizing and describing the metadata for geospatial data. STACs describe parameters like the date and time of image capture, location of the image on the earth’s surface, the URL for accessing data, the type of data you will receive, etc. in a JSON format. They allow us to find data, as well as filter the exact data we want. STACs empower users to get what they want efficiently, without any overhead for the data provider.
A STAC can be thought of as describing a multidimensional array with multidimensional metadata. In the following photo, we can see the STAC on the left looks similar to the xarray logo!
Xarray is a popular library in the science community that helps keep track of metadata and work with these multidimensional datasets. A 4-dimensional xarray DataArray is a natural way to represent a STAC: the spatial dimensions, the time dimension, and the asset dimension; where every item in the STAC can have multiple assets.
Dask is the second logical library to include here. In fact, the STAC in the above GIF also looks like a Dask Array! Dask lets us represent this large data, and comes with another advantage – parallel processing and cloud computing. It’s important to note that more computing cores is not always the solution, and it’s worth optimizing your analysis for single-machine-compute first.
Dask is useful in the case of Earth observation data because we can have up to 30 PBs of data as mentioned earlier. It’s not feasible to move this amount of data around. Dask allows us to ship our computation to where the data already lives (cloud data centers).
“Even if you’re working with only 2GB of data, it’s nice to not have to bring the data to you, and computing on the cloud can be much faster.”
stackstack – Bringing the Power of xarray and Dask to Geospatial Analysis
As we saw earlier, it would be very useful if STAC looked like xarray backed by Dask. That’s exactly why Gabe developed stackstac. stackstac makes a stack out of our STAC—more concretely, it helps turn a STAC into an xarray DataArray as shown below:
In the chunked Dask Array shown above, each cube in the diagram is one tile of one GeoTIFF. Despite the simplicity here, it is surprisingly challenging to achieve:
- Data is usually not an elegant and even-gridded
- NumPy arrays have to be rectangles, but all GeoTIFF files do not have the same dimension or even represent that same area on the earth’s surface.
We need to pick a bounding box and a pixel size, and a consistent coordinate system across all files, then combine them into a NumPy stack. stackstac helps pick these common spatial parameters for you.
stackstac also helps us deal with GDAL multithreading (which can be very difficult to do!) and optimizes Dask graphs. Just by having data in xarray format, a whole new world of geospatial analysis opens up, which also extends to everything Python! Gabe talks more about these in detail and shares what’s missing in stackstac in the webinar recording.
Live Visualization on Interactive Map with stacstac
Gabe creates a GIF visualizing all the Landsat data around the Cape Cod region in Massachusetts, USA. He also visualizes satellite imagery live on an interactive interactive ipyleaflet map!
Check it out in the webinar recording and the following notebooks!
Join the Community!
The best way to support stackstac is by using it. Gabe welcomes you to open new issues and create PRs in the stackstac GitHub repository. We would love to build a community around stackstac!
At Coiled, we contribute to improving Dask and making it more accessible. We’re building Coiled Cloud that lets you create Dask clusters on the cloud, in just one step. Try it out below: