Scalable Computing in Oceanography with Dask and xarray
• October 13, 2020
Deepak Cherian is a physical oceanographer and project scientist at the National Center for Atmospheric Research. He recently joined us to discuss scalable computing in oceanography and how he leverages Dask, xarray (he’s a lead maintainer!), and terabyte-scale datasets to study the physics of oceans.
In this post, we’ll summarize the key takeaways from the stream. We cover:
- What is physical oceanography?
- Introducing xarray and Dask in Python
- Visualizing these large scale simulations
- xarray objects are Dask collections!
- Reading entire sets of files as xarray datasets
You can find the code for the session in this GitHub repo.
What is physical oceanography?
Deepak studies the fluid dynamics and fluid physics of the oceans. What does oceanography look like today?
The long answer involves being on ships for 30 days or more around the world, taking measurements with autonomous robots, deploying instruments on moorings that record data for a year at a time, and more.
The gist: modelers solves finite difference equations, basically Newton’s second law for fluid physics (a.k.a. the Navier-Stokes equations). You specialize those equations for the ocean you’re dealing with (specifying for atmospheric conditions) and solve them numerically. TLDR; Deepak performs large-scale ocean simulations using differential equations.
Xarray, Dask, and coral reefs
For the rest of this post, Deepak (as quoted below) leads us through his work, which sits at the intersection of oceanography, xarray, and Dask.
Reminder: xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!
He started by describing a new project he’s working on in the Eastern Tropical Pacific.
“These are locations of marine protected areas that have coral reefs, and we are interested in preserving these corals. The idea behind this project is that we have a bunch of these Western Pacific corals that are known to be living in somewhat extreme conditions defined by chemistry and so we think that they’re probably resilient to these future [adverse] climate conditions [caused by global climate change].”
“These reefs are pretty interconnected with each other and that’s illustrated with these little arrows down here in a published scientific paper.”
If Deepak & colleagues can identify the portions of the coastline that are resilient to adverse climate conditions and are effective at reaching other places, these corals are good candidates for conservation.
How does Deepak identify these reefs?
Diving into xarray and Dask in Python
The key thing about xarray is that it wraps other array packages like Dask array, and in wrapping them it adds a layer of metadata handling.
One way is to run an ocean model. The output of that is something called a flow field (with velocities in all three dimensions). We can then insert little particles in these flow fields that represent coral larvae. You see where these particles go relative to where they are deployed. If the particle tracks indicate large connectivity between the place they were deployed and the place they land at, then those deployment locations may be a good place to spend limited conservation resources.
Here’s the process with technologies inserted, “So the first thing we do is we run a model with FORTRAN. Then, using xarray and Dask, we save and analyze a lot of output of that model.” Deepak refers to xarray and Dask here as “scalable post-processing analysis software.”
“The key thing about xarray is that it wraps other array packages like Dask array, and in wrapping them it adds a layer of metadata handling. That lets xarray users use the metadata for analysis code.”
Deepak then showed us how he set up his cluster. He’s using NCAR’s Cheyenne cluster, which means he’s using high-performance computing (a.k.a. HPC). He’s also leveraging Dask-Jobqueue, which he imported at the top of the cell below.
“So now I have 12 Dask workers all waiting for me to give them something to do.”
Deepak’s data is a large collection of netCDF files. “We have one file per day which represents the daily average of all our fields. And currently we look at one year of data so that’s 365 files. With 2 GB per file, we’re looking at 720 GB on disk.”
Deepak read in one file with xarray.
“This array represents seawater potential temperature with units degrees celsius. Now you’re starting to see the metadata that xarray is layering on top of Dask array. This Dask array has four axes, so that’s a four-dimensional array. Axis zero is time, axis one is zl which is depth, axis two is yh which is latitude, and axis three is xh which is longitude.”
“So this is the first level of metadata that xarray adds: names for your axes.”
The second thing xarray does is it brings coordinate labels. “If I go up here, xarray tells me that my yh has values from -14.975 up to 29.975 (in degrees north, since these are latitudes).”
Deepak summarized xarray’s value proposition: “This metadata is all bundled up together in a little xarray object. This information encapsulated together is super convenient to use!”
Deepak then inspected the temperature data further. “So [we type] ds.thetao to pull out the first element on the zl dimension. So this would be the most surface layer. If you look at our metadata, you’ll see that zl is 1.25 — so this is 1.25 meters below the surface of the water.”
“And you see that underneath, what’s happened is xarray has taken this isel statement and made it a Dask array indexing operation, so I’ve lost one dimension.”
Now it’s time to do some oceanographic analysis.
Visualizing these large scale simulations
This is pretty readable code. I can go through this two months from now and I’ll know that I picked the first 100 meters, I took a mean, and then I plotted it.
Visualizing the ocean model output data is the high-impact next step. Figures like the one he creates below help identify ideal targets for conservation and are sent to decision makers. xarray and Dask’s and compatibility with matplotlib makes this easy.
“I just run .plot(), you see a little task showed up for reading it in the Dask dashboard, and we get a pretty image of sea surface temperature for this model.”
“So this is a matplotlib plot. xarray told Dask array to give it the actual value, so you see a little compute task there and it’s plotted up on screen. xarray also used all the metadata it knows to make a little nice labeled figure.”
Hugo popped in, “This is awesome. I’m a total xarray noob, but I understand all the code you’ve showed, which means that there’s a lot of great work that’s being done into API design. You’ve written minimal code and that makes me feel good.”
Back to Deepak, emphasizing xarray’s convenience. “Nobody just has data. You always have data plus some metadata. And the nice thing about xarray and pandas is that you can use that knowledge. You don’t have to remember what axis zero was.”
Deepak showed us another matplotlib visualization using the same process. “This is showing you that the equatorial pacific tends to be kind of cold. And you see some cool vortex eddy features up there.”
“And like you were saying, this is pretty readable code. I can go through this two months from now and I’ll know that I picked the first 100 meters, I took a mean, and then I plotted it.”
Hugo hopped in, “I hesitate to say this, but this looks like what I want my NumPy code to look like sometimes. As you say, using the metadata that works with my mental model of my data.”
Deepak quipped back, “Yes, you definitely need xarray then!”
xarray objects are Dask collections!
xarray objects are Dask collections. If you’re a Dask user, it’s not much of a jump to using xarray, which is great! Deepak continued, “So in this case I’ve called dask.persist on a couple of variables to save time. You’ll see that it executed and it loaded things into my distributed cluster memory.”
“And now I can do my usual xarray things. All of my computations will be scattered across the cluster. Here’s a Dask graph using dask.visualize. In this case, it’s showing that there’s a file, there’s an open_dataset task, and that’s going to give me my array.”
Deepak finished this section by noting that xarray can wrap other “N-D array-like things” like pydata/sparse arrays, CuPy GPU arrays, and pint, which does unit-aware computations.
Reading entire sets of files as xarray datasets
Here’s where we start to really leverage Dask to read all of those in parallel.
Deepak just showed us his xarray workflow with a single file. But we have 365 files. Time for Dask to shine.
“Here’s where we start to really leverage Dask to read all of those in parallel. We use a common pattern with dask.delayed. This code goes and touches all 365 files, reads just the metadata (not the actual data), and creates a representation of that dataset on disk.”
“Let’s look at the temperature Dask array again. This array is approximately 200 GB.”
So Deepak used dask.delayed to create a bunch of Dask arrays, which get wrapped by xarray. Now, “all the cool stuff that I showed earlier just works! What are the monthly mean sea surface temperatures? Here’s how you would express that in xarray.”
This code should look familiar if you’ve used pandas, which is intentional on xarray’s part as it tries to mimic the pandas API. “And we get back a little Dask thing again. There are 12 points on the time dimension (one per month).”
Deepak pointed out the benefits of Dask again. “And it’s all lazy because we’re wrapping Dask arrays! Everything’s lazy until you actually ask for a value. So I’m going to do that here again with .plot(). I run it, you see Dask is computing, and now matplotlib is making me a figure.”
“So this is one panel per month, where each panel is the monthly mean sea surface temperature. If you focus at the equator, it’s kind of warm up top, and down here it’s kind of cool. That happens to do with the winds getting more intense during the summer which ends up cooling the sea surface.”
Deepak described his workflow at a high-level. “I tend to load a representation of a big thing, then reduce it down to a small thing, and then look at that for details. And then once I see something interesting, I might go back and then compute something on the whole thing.”
Matt followed up, “That’s the beauty of xarray. It gives you the expressibility to select what you want and the beauty of laziness which allows you to think about the whole thing in memory but only actually work with the bits that you need.”
Dask and xarray are an excellent pair, and Deepak Cherian’s one of the best to talk about them. We’re grateful for his time.
As we learned, Deepak’s workflow is made easy with Dask, and here at Coiled we manage Dask to make it easy to scale on the cloud. Feel free to learn more about what we’ve been up to by checking out our free Coiled Cloud Beta below.