Dask in Action with Massive Satellite Datasets
• August 10, 2020
action (noun): the most vigorous, productive, or exciting activity in a particular field, area, or group. // wants to be where the action is.
We recently spoke with oceanographer, remote sensing expert, and open science advocate Chelle Gentemann, about her experiences working with massive satellite datasets and how Python and Dask make the scientific process more efficient.
We had so much fun we followed up with Chelle about the specific scientific questions Dask solves for, given the massive satellite datasets she works with. Chelle also shares why she feels the need to do an evil laugh whenever she uses Dask.
In this post we cover the following topics:
- The problems Dask solves,
- Chelle’s favorite Dask features,
- Chelle’s Dask pain points,
- Pangeo, a community of scientists that are focused on building community, reproducible science, and open-source software,
- Chelle’s computational stack.
This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.
The Problems Dask Solves
What Dask has allowed me to do is I just do everything on the fly now, as it allows me to code up my question while not having to worry about the compute.
HBA: Hey Chelle! What type of scientific questions does Dask help you to answer?
CG: When I use Dask, some of the questions I’m answering are
- “How do things change in our environment?”
- “How are the distribution of seabirds related to the distribution of ocean temperatures?”
- “How do you bring really, really big satellite datasets together and do different types of analyses?”
- “How does the lower atmosphere and the upper ocean interact with each other to produce weather?”, as I often analyze air-sea interaction.
I also do a lot of biophysical research where you look at the physical oceanography, like currents and eddies, and how the ocean environment may affect ecosystems. And all of this requires a lot of big satellite datasets. What I’m usually doing is looking for changes in one parameter that correlates to changes in another.
It used to be that I would have to go through all the data, calculate these different climatologies, and then save them, all of which are huge amounts of data. And say if you wanted to ask a new question, you’d have to go through that same lengthy process again because it’s a different question. What Dask has allowed me to do is I just do everything on the fly now. Plus, I’ve stopped saving all these different intermediate datasets.
Now, all I have is the description of my question and an understanding of what I need to do. And I just do that on the fly now, as Dask allows me to code up my question while not having to worry about the compute. Again, it allows me to analyze these large datasets that I’d normally be analyzing piece-by-piece.
HBA: The workflow you described reminds me of in machine learning when doing things like hyperparameter tuning. A lot of the time there’s a zen to it, where we try something and write it down, or have a spreadsheet with all of these numbers. There’s some magic we’re doing there.
CG: Exactly, I have a notebook for each project because I’m constantly experimenting, exploring, and visualizing. And so much of scientific innovation depends on that chance finding, when you are looking for one thing and find something else interesting. Seeing features of data requires exploration and xarray and Dask are so powerful for science. I can explore the entire dataset easily and quickly as opposed to fiddling and struggling to read in data. I have access to everything right away.
HBA: I love that statement: I have access to everything. Whereas with previous technologies. you didn’t. You had access to small chunks at different points in time, as opposed to having a slightly more bird’s eye view, so to speak.
CG: Yeah, it changes how you look at science. Like, it’s just crazy. I feel like I should do an evil laugh every time I say that. It’s like I have everything at my fingertips.
HBA: I love it.
CG: It makes it so fast to explore hypotheses. Instead of having to load, find the file name, and load that data, it’s just there. It was funny! Like, the whole time I was learning Python, I was just at my computer just laughing. And I was just like, “This is just silly. This is totally silly.” And people are asking me, “What are you doing? Why are you laughing” I’m like, “I’m coding!” My kids complain that I don’t work, they go to school, and it is hard, while I just sit up at my computer and laugh.
The Dask dashboard really helps you scale your science on the cloud and understand where you’re at so you’re not wasting time and resources.
HBA: Do you have any favorite Dask features?
CG: When you go to the Dask dashboard and you open up your different workers and your processes and your memory. Being able to visualize that. It’s almost mesmerizing, right?
I think that’s a feature of Dask that’s going to really help people get excited about learning cloud computing because you can show them what’s happening visually. You can see how all of the different processes are getting completed and Dask is doing.
HBA: Dynamically, right! You see evolution.
CG: Yeah, and it’s really nice! You know when a process is going wrong. And you when a process is going right, because you’re like, “okay, it’s just going to be another two minutes” or “I’m going to open up another bunch of workers because this is going to slowly.” The Dask dashboard really helps you scale your science on the cloud and understand where you’re at so you’re not wasting time and resources.
Dask Pain Points
HBA: Is there anything frustrating about Dask that you’ve experienced?
CG: I mean, sometimes I get these errors. For example, in one of my notebooks I get an error on AWS and I can’t tell if it’s a Dask issue or if it’s an AWS deployment issue. It’s unclear where these errors come from. That’s often true of coding in general, though.
HBA: A lack of clarity around that is tough. But the other thing that you’ve mentioned previously is the fact that it’s inconsistent across different cloud providers, which is annoying, right? Like if someone’s using AWS and someone’s using GCP, you’d like to have the same experience in Dask.
CG: Yeah, and I get it. I mean, there’s a lot that goes into making a deployment work on AWS and a deployment work on GCP. So my suspicion is that it’s something in that deployment. Luckily, I’m able to rely on the Pangeo community to create this playground for me.
It is easier to collaborate and build on results when everyone has equal access to data and compute.
HBA: Yeah. So that raises something we haven’t really talked about…Pangeo! Maybe you could give me the rundown.
CG: So, Pangeo is a community of scientists that are focused on building community, reproducible science, and open-source software. The founders, Ryan Abernathy and Joe Hamman, had this idea to create a platform to demonstrate how powerful cloud computing could be for Earth science. So they created this platform, originally on Google Cloud using Kubernetes where you could spin up all of these workers to do big data analysis. And it evolved into this amazing community of developers and scientists working side-by-side to advance science.
Now Pangeo is really trying to bring in even more scientists to the community. Originally, I think many of us thought that if you’re not doing “big” data analysis, then you don’t need cloud computing. And maybe that is partly true, but what I have found is that once you start doing cloud computing and see the power of this workflow, you buy in and want to bring others in too. Bringing your analysis to the cloud makes your science faster, more reproducible, and more shareable. It is easier to collaborate and build on results when everyone has equal access to data and compute. Also, to look at relationships and understand connections you often need access to large datasets, however downloading these data is time- and resource-intensive. Cloud computing skips this step and allows instantaneous access to the entire dataset.
So Pangeo is this big online community with a chat on Gitter and a Discourse forum. Pangeo is trying to help scientists do big data analysis on the cloud. This is a really useful resource because when you get stuck, there are developers that can help you. When I have a question, there is the community who can help me, and other scientists like me.
Chelle’s Computational Stack
HBA: So the final question: what else is in your stack? We’ve talked about xarray. We’ve talked about Dask. We’ve touched on Jupyter via Binder. What’s the rest of your computational stack look like?
CG: I do some stuff with GeoPandas. Ecologists deal with shape files a lot so I use GeoPandas to get get shape files into xarrray. I do a lot of NumPy. I also use SatPy, PyResample, SciPy and scikit-learn because I do a lot of regressions and detrending of data and power spectrum analysis. All of those sorts of things.
HBA: Thanks, Chelle!
CG: Thank you!
If you’re interested in reading more content from the Dask and PyData community, pop your email address in the form below.