Accelerating Science with Dask
• July 26, 2020
Many scientists can “do” science well. Few can do it well at scale.
In this post, we share Part 1 of our interview with oceanographer, remote sensing expert, and open science advocate Chelle Gentemann. Chelle shares her experiences working with massive satellite datasets and how Python and Dask make the scientific process more efficient.
In this post we cover the following topics:
- Working with NASA satellite data,
- Learning Python or the power of the open-source,
- Discovering Dask and accelerating science,
- Dask as democracy.
This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.
Hugo Bowne-Anderson: Chelle, perhaps you could give me a bit of context around your background and what you do.
Chelle Gentemann: I’ve been a scientist for a little over 20 years, specializing in satellite remote sensing and oceanography. Almost all of my funding has been through NASA because I work on satellites.
After grad school, I was working as a physical oceanographer at a company that specialized in developing algorithms for passive microwave sensors. Our job was to develop algorithms, in other words, we took what the satellite measured and used physics to determine ocean temperature, wind speed, etc..
I started working on blending different satellites together to produce daily maps of ocean temperature at really high resolution. In order to do this, I would have to download like 280 files for each satellite data set, per day. I was bringing in six or seven different satellites. At that time, I was programming just in MATLAB and Fortran, writing code to parallelize the processes on my own. I was creating different scripts to run different parts of the analysis and a lead program that would check for when things were finished. Everything was limited by how much RAM my computer had. So I would have this one fancy computer that I could run things on because that was the only place I could open up these massive files. I could open bits of the files, but if I opened the whole thing, I had to be really careful about managing my memory and how I did my analysis.
HBA: And when was this?
CG: This was in the early 2000s. When I left that company, I lost access to all the software I had developed over my career because it was a commercial company. This was a bit frightening for someone mid-career. But in the end I was like, “Okay, I’m going to solve this problem and improve. I’m going to try Python for my science.”
LEARNING PYTHON OR THE POWER OF THE OPEN-SOURCE
I was just completely blown away by how it accelerated the speed at which I could do science.
CG: So I learned Python and as I started to learn, I was just completely blown away by how it accelerated the speed at which I could do science. I had been so used to having to code everything myself. For example, to open large datasets, I don’t have to worry so much about my local RAM because of lazy loading. Having all these Python tools at my fingertips was transformative.
HBA: Like Python’s batteries included philosophy, right?
CG: Yes, and using these tools not only makes my code more understandable, but the tools have been ‘tested’ by many users and have a lower change of any errors. So not having to debug as much code makes my process faster. Because I’d been at a commercial company, I wasn’t aware of the power of open-source software. I was trapped at that company without any control over my intellectual property, and open source software solves that. So I embraced open-source software, and I embraced Python, and I really started learning very quickly.
I started creating courses and giving tutorials to bring other oceanographers on board because using data in the cloud requires a shift in skill set for scientists. Our governmental agencies are moving data onto the cloud. Having easy access to the massive volume and variety of data that satellites collect is going to open up new types of science, which is really exciting. Being an early adopter has opened up many doors and given me new opportunities.
DISCOVERING DASK & ACCELERATING SCIENCE
The cool thing about Python and Dask and xarray is that it takes the fortress that data used to be and destroys it.
HBA: Early on in your Python days, did you encounter a limitation of working with pandas? Is that how you came across Dask?
CG: Yes. I started with Python, taking NumPy and pandas tutorials. Then I got the NetCDF4 library installed. When I first started, I was basically writing Fortran code in Python. Everything was a loop! As I started to get comfortable and explore more in the Python world, I found out about xarray. Previously, if I wanted to create a climatology, I would have to go through all the data once, create the climatology, save the climatology, then go through the data again to calculate the anomalies. Now, with xarray, which is built on top of Dask, I was able to create the climatology on the fly because xarray doesn’t create them until you need them. This is really powerful because it makes my science more reproducible by taking out the creation of an intermediate dataset.
The program never touches the data until you actually need it! And that affects what types of questions you can ask. It affects how you think about science. I was raised in this era where the questions that I could ask and the type of analysis I could do was constrained. And so I naturally evolved to bake these constraints into all of my hypotheses when analyzing satellite data. And the cool thing about Python and Dask and xarray is that it takes the fortress that data used to be and destroys it. You used to have to learn how to navigate the fortress in order to get to the top and find your solution. That’s destroyed now. You can just ask questions and not worry about the compute because Dask takes care of it for you.
HBA: Amazing. I love the idea of a fortress that previously was impenetrable. Dask, xarray, and the PyData ecosystem levels the landscape and allows you to scale.
The other thing that I really love that you’re speaking to is that technology facilitates building new mental models of how to think about the scientific process methodology. How to ask questions, all of these types of things.
CG: Yes. How is science organized? How do we do science? And how do we think about improving those. Cloud computing, to me, has this incredible opportunity to level the playing field.
My kids have these $36 Raspberry Pi mini-computers, which are just a motherboard and a wireless card, and the memory is a micro SD card. You can hook it up to an old keyboard and a screen, and install the Kano OS open source operating system. Kano has all of these Python games that my kids play.
So I go on it one day, open an open-source browser, and go to some of my public tutorials and I’m able to run Python tutorials using cloud data. I have like 250 GB of RAM open. I’ve got almost 100 workers going on both AWS and GCP at the same time doing these massive analyses. And I’m doing it for free because I’m using Binder! This type of massive computational power used to only be accessible to people at big institutions who had access to big computers and people to help them store data and organize the data and clean the data. When you start doing things with cloud optimized data using xarray and Dask, it just sort of breaks down everything that we’ve built science on. It opens up this whole new world where almost anybody can participate in science.
DASK AS DEMOCRACY
Dask accelerates my science by allowing me to focus on the science rather than the methodology for my computation.
HBA: Phenomenal. I don’t like buzz terms but what we’re talking about here is democratizing tools, techniques, and processing. Making them accessible is critical.
CG: Dask accelerates my science by allowing me to focus on the science rather than the methodology for my computation. I remember coding all that parallel processing up and it’s really hard! You have to become part computer scientist. You have to have a level of familiarity with your data and you have to be a pretty good programmer. Whereas when Dask takes care of all of that in the background for you, it allows you to focus on your domain expertise.
HBA: I love it. Peter Wang gave this wonderful talk at JupyterCon several years ago, where he spoke about the open-source ecosystem and PyData in particular as being a substrate for innovation. You don’t necessarily need to look under the hood or understand how the combustion engine works anymore. Tools like Dask are part of a foundational layer that allows you to ask questions and ultimately “do science” more efficiently.
CG: Yes. Another big innovation here is when I used to program in Fortran and MATLAB, I couldn’t share my code easily. And people weren’t building these open-source libraries so everyone was starting from scratch. When I started using Python, there were all of these advanced libraries like xarray and Dask, and I felt like I was starting from the top of the mountain. I’m already doing interdisciplinary science. I’m building on the innovation of others and research that I can take and immediately start doing new things.
HBA: Standing on the shoulders of giants of open-source development.
Subscribe to our newsletter below and we’ll let you know when we’ve put Part 2 of our interview with Chelle live, Dask in Action.