• April 29, 2021
Distributed Data Science and Oceanography with Dask
Dr. Chelle Gentemann is an oceanographer specializing in data taken from NASA satellites and robotic sailing vehicles. She is the lead scientist on an exciting new proposal for a new NASA satellite called Butterfly, and a fierce advocate for open science and inclusivity. Chelle is not only a remarkable scientist, she is also a dear friend of Coiled who has supported us from the very beginning.
We loved having Chelle for a Science Thursday livestream this month to discuss the future of oceanography and the role Dask has to play in it. Check out the replay below:
In this post, we will cover:
- How do oceanographers image ocean temperature?
- What does their data analysis pipeline look like today, and where is it headed with Dask?
- Where to access cloud satellite data and how to work with it? See how, just by accessing data on the cloud, you can save nearly a week typically spent reading data!
- How Dask is empowering scientists to ask different questions and consider new solutions.
Imaging Ocean Temperature
The ocean impacts our weather and climate, so measuring ocean temperature is important and is one of the primary measurements used in weather models. There are many satellites imaging Earth’s ocean temperatures at different frequencies and at different orbits. The following slide shows two such satellites:
- the Aqua satellite that carries a microwave radiometer and an instrument called MODIS that measures temperature in infrared frequencies; and
- the GOES-R satellite that images ocean temperatures, along with clouds and various other parameters.
On the left, we see the gulf stream that flows in the west coast of the United States. The stream forms eddies as it starts to separate, which affects our daily weather and climate.
The different satellites collect a lot of data, which includes missing values that are bad because of rain contamination or clouds. For example, a polar-orbiting satellite can have vertical strips of missing data. Oceanographers need to combine numerous such datasets to create a cohesive map, like the one shown below, that can be used for science and weather predictions.
Data Science Pipelines for Oceanography
The traditional pipeline for analysis starts with the time-consuming dataset downloads. This is followed by preprocessing and breaking the data into smaller pieces that fit in memory. We can then run interpolation and analysis over that data.
Chelle remarks how that’s a lot of coding! Especially for scientists who weren’t trained for software development. This not only makes the system more prone to failure, but also requires a lot of maintenance.
The community is steadily moving towards a future pipeline that involves writing a lot less code. We don’t need to download all the data, and can work directly in the cloud. Data filtering and preprocessing can be done with lazy loading in a delayed fashion. We can then build our workflow and move directly to Prefect-managed interpolation and analysis. Dask lets scientists have access to everything, but computes the scheme at the very end.
“To me, Dask manages all of the memory, it manages all of the workers, it does all of the things that I used to have to do and write in my really kludgy fortran and tracking files, it’s all very graceful in Dask.”
Cloud Satellite Data
There are 4 types of satellite data:
- L1 – Unprocessed satellite observations with latitude, longitude, and time.
- L2 – Geophysical retrievals (like ocean temperature, wind, etc.) derived from the previous data type.
- L3 – Geophysical retrievals mapped to a uniform space-time grid.
- L4 – Model analysis of lower-level data, which may include multiple sensors.
Oceanographers develop algorithms to transform one data type into another, like L1 to L2. These algorithms weren’t previously shared because they were designed to run within specific computational environments. The Python open source ecosystem consisting of conda, Jupyter, binder, and many others, have made it much easier to share these environments, and the Earth science community is now moving towards an open system. The two major benefits here are:
- Traceability of code, and
- Reduction of redundant effort.
“When you provide access to the environment, you’re opening the door to any scientist, anywhere in the world to do science. Having different voices being able to participate in science is really going to change for the better,and what solutions we find.”
Accessing Data in the Cloud
There exists a Registry of Open Data where we can access the data for free. Chelle demonstrates using AWS, but many cloud providers like Google Cloud Platform and Microsoft also have similar programs.
We use the 1 kilometer global temperature dataset. The first step is importing the required libraries:
import datetime as dt import xarray as xr import fsspec import s3fs from matplotlib import pyplot as plt import numpy as np import pandas as pd
We can now access the ~4 TB dataset from Amazon S3, map it to a location, and convert it into an xarray dataset
file_location = 's3://mur-sst/zarr' ikey = fsspec.get_mapper(file_location, anon=True) ds_sst = xr.open_zarr(ikey,consolidated=True)
This operation takes Chelle just about 12 seconds, where all the metadata is loaded, but not the actual data. This allows scientists to get an understanding of the dataset quickly and make decisions about the best way to chunk the data depending on the computations they are doing.
We then read the entire 10 years of data for a single point and visualize it:
sst_timeseries = ds_sst['analysed_sst'].sel(time = slice('2010-01-01','2020-01-01'), lat = 47, lon = -145).load() sst_timeseries.plot()
Just by accessing data on the cloud, we saved nearly a week typically spent reading data! It’s also worth mentioning how many open source tools have worked together to render this plot — we’re working in a Jupyter notebook, using xarray and Dask, and plotting with matplotlib.
Chelle talks further about the interesting anomalies here, demonstrates another analysis of the Chukchi Sea Surface Temperature data, and discusses why grid resolution doesn’t always equal spatial resolution. Check it out in the livestream replay!
Dask is Empowering Scientists
Dask isn’t used directly in the analysis, but it powers libraries like xarray and Pangeo that are integral to oceanography. It allows scientists to condense their computations to one line of code, and takes care of everything else for them.
“I learned how to do science where the first step was thinking about the data. I think that my mind is in some ways limited to what problems I think I can answer, and what questions I think I can ask, because in the back of my mind, I always know too much about the data constraints.
I feel we’re going to enter a period fairly quickly where there’s going to be a generation of scientists who don’t worry about that, because they don’t have to, and that’s going to change what possible questions they’re asking, and I’ll try to change and adapt with them.”
Here at Coiled, that’s exactly our mission. We aim to make Dask and Cloud Computing accessible to data scientists by taking care of the software and DevOps elements, so that they can get back to their science.