Interactive Computing at Scale with Dask

TL;DR

Simulation isn’t reality. But with Dask, it feels like it is.

In this post, we share Part 1 of our interview with Lindsey Heagy, Postdoctoral Researcher in the Department of Statistics at UC Berkeley. Lindsey shares how open-source tools like Dask helped her decrease the time it took to run her geophysical simulations by a factor of 5,000 and how this changed how she can think and reason about her science.

In this post we cover the following topics:

  1. Lindsey’s research and SimPEG,
  2. Interactive computing in scientific research,
  3. Her first contact with Dask and its utility,
  4. The impact of Dask on the scientific process.

This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.

INTRODUCING LINDSEY

We started a project called SimPEG, which is an open-source framework for simulation and gradient-based parameter estimation in geophysical applications.

Hugo Bowne-Anderson: Lindsey, can you tell me at a high level a bit about your research and what you do?

Lindsey Heagy: I did my PhD at UBC in Vancouver in geophysics, so mostly simulating and inverting partial differential equations (PDE). For example, we’d use electromagnetic data to get an electrical conductivity model of a subsurface, and then that’s something that we can then use to interpret geology. So it might be groundwater applications or mining applications, but somewhere where we want to be characterizing the subsurface. And that’s really how I got into open-source software.

We started a project called SimPEG, which is an open-source framework for simulation and gradient-based parameter estimation in geophysical applications. We did a fair bit of evangelism early on to get folks in the mining industry and groundwater industry on board. And a lot of that was actually through educational initiatives. We built a whole series of notebooks that connected widgets to simulation code, so you can move around your sources. If we’re doing, for example, a DC resistivity experiment, we’re going to hook up two electrodes to the ground and we’re going to pump current in the ground and then measure potentials. So it’s basically a big V equals IR experiment, and we’re trying to estimate the distributed resistivity of the subsurface. Being able to build up a model like that and then show where the current’s going on is a pretty powerful visual for trying to understand what your data actually are.

Franklin Carmichael's "A Northern Silver Mine" artwork.

INTERACTIVE COMPUTING IN SCIENTIFIC RESEARCH

We create notebooks so folks who have never seen code before can pick this up.

HBA: I was just going to ask if you use Jupyter Notebooks and ipywidgets or something like that.

LH: Yeah. We create notebooks so folks who have never seen code before can pick this up. All they need to do is click “Run All” and then start playing with the widgets. That was what I think drove a lot of people to be interested in SimPEG, but it also got me much more deeply connected with folks in the broader open-source software world. I got connected with Fernando Perez at a conference where he was giving some presentations on Jupyter. And we were talking about the use of widgets for creating interactive visualizations that are actually running simulations. So for each widget update, we’re actually solving a PDE. At the time, this was likely one of the more advanced uses of widgets for scientific computing.

Fernando just joined the stats department at UC Berkeley and I wanted to do a postdoc so I said, “Can I come do a postdoc with you and do some geoscience and Jupyter and stats and it’ll be fun?” And he said yes! That’s what really started getting me connected with the Pangeo community. And that’s when I first got connected with Matt [Rocklin, founder of Coiled] and I started learning about Dask. I also started getting a sense of the more advanced packages that have really accelerated the possibilities for data science—especially in the geosciences, which I’m quite excited about.

So that’s the arc until now. And next up, I’ll actually be going back to UBC but now as an assistant professor, which I’m very excited about.

A screenshot of Lindsey Heagy's DC-2-layer-foundation-app Jupyter Notebook.

HBA: Oh congratulations! It’s great to hear good news in the midst of COVID-19. I hope that you’re able to enjoy it as much as possible.

LH: Absolutely. One positive is that I got to share the moment with my parents since I’m with them at the moment.

FIRST CONTACT WITH DASK AND ITS UTILITY

We want to run a lot of simulations. We’re focusing on the compute and CPU and memory advantages of Dask.

HBA: So you mentioned you first heard about Dask when you were at Berkeley. When when was that?

LH: I think the very first time that I heard about it was at SciPy in 2016. But when I moved to Berkeley in 2018, that’s when I actually started looking at it much more seriously.

HBA: Did you have problems that other software wouldn’t solve?

LH: Yeah, so a lot of the problems that we solve in geophysics are parallelizable. But what we saw when looking at multiprocessing and other similar approaches is that they’re pretty invasive in the code, which takes away from code readability. Part of our goal was to make our code very readable so the code itself could be a teaching tool. So that was actually a lot of the motivation for looking more seriously into Dask. Then also for being able to go between parallelization on a laptop to parallelization on an HPC cluster because a lot of the serious problems that we want to solve are HPC sized. Seeing that both were possible with Dask was a big draw.

HBA: A lot of people I speak to use Dask because they have large datasets. But it seems that you use it because you want to run simulations and not necessarily with massive volumes of data.

LH: We want to run a lot of simulations. So if we take, for example, airborne electromagnetics to try and characterize groundwater systems. We have to solve a PDE for basically every helicopter’s location that we want to use. And so you can parallelize that. And so that’s where we’re at. We’re focusing much more on the compute and CPU and memory advantages of Dask, rather than just the data volume side of things.

HBA: This makes me think about the use of Dask for machine learning, which you can consider as a sort of computational two by two matrix. On the x axis, you have data size and on the y axis you have model size and one’s ram bound and one CPU bound. And this this, I mean, this isn’t machine learning, but it sounds model or simulation bound in that size.

Tom Augspurger presenting at PyData NYC 2019.

LH: Exactly.

HBA: Cool. Can you give me an idea of timescale and how long it takes you to solve something with Dask versus without it?

LH: It’s a function of the number of sources. If we can parallelize times 1,000, that’s a factor of 1,000 reduction. Another cool thing is we can combine Dask and Zarr for a fivefold increase in speed, which is actually complementary to parallelizing over sources as well. So we can get speedups at both levels, which is pretty exciting.

HBA: So then that’s a speedup of 5,000 times or something like that, right?

LH: Yeah, it can be.

THE IMPACT OF DASK ON THE SCIENTIFIC PROCESS

Dask is shaving off time from massive computations from weeks to days to hours.

HBA: Does that change the way you do science or think about science or the iteration process?

LH: I think there’s a few different ways Dask changes our process. Some of that depends on basically how expensive the problem was in the first place. For some of these computations that we’re running, folks expect it to take a day to a week. The chunks of time Dask saves us are huge because that means you can run a few more examples now. So we’ll branch out and do more because of that.

But I also think that one of the more interesting things, especially when actually hooking stuff up to ipywidgets, is that users expect real-time feedback. Like if this simulation isn’t instantaneous, I don’t have the patience for that. It’s interesting to expect that in the education setting, but also now to expect that in the research setting. I should actually just be able to change a parameter and have that feedback loop not be so disruptive that I lose my train of thought. If I can reduce something that would have taken 15 minutes down to one minute, that’s actually huge because in fifteen minutes I’m going to brain switch into doing something else. In one minute, that feedback cycle is short enough that I can still keep my train of thought.

HBA: That’s awesome. I am hearing Dask almost helps you get into a flow state with with your science. Cool. That’s really nice. So we’ve talked about one application of Dask. Could you share a few more?

LH: Sure. I’ll talk to you maybe through two different styles of examples, because two different things come to mind. There are some problems in electromagnetic geophysics where we can think about discretizing them in the time domain or the frequency domain. It’s more intuitive for us to think about the things in the time domain. Some experiments are better set up for thinking about the data in the time domain, however there are others that are better set up in the frequency domain. And one example in the frequency domain is an experiment called magnetotellurics.

HBA: What’s that?

LH: Magnetotellurics uses natural fields. So solar, wind, and lightning strikes around the world generate ambient electromagnetic energy. Then we put electrodes and a magnetic field receiver on the ground and we watch those fields over time. And then we can do some signal processing and get data at each frequency. We can then use that frequency-domain data as an input to an inversion algorithm that can be used to estimate the conductivity of the subsurface. Depending on the setting, these results might be used to characterize a geothermal reservoir, a mineral deposit, or to image deep crustal structures.

What’s nice here is that when we’re solving the inversion problem computationally, it is separable over frequencies and we can solve each of those frequencies completely independently. Parallelizing with Dask here can save us hours (or days!), which is a nice thing.

A path in a field leading to a lightning strike.

I’ve also used Dask for setting up a batch of experiments. I know what my experimental setup needs to be, but now I’m going to want to do a parameter sweep and get some data back for all of these different exam goals. I think the Dask job queue setup is really cool because I can interactively set up my simulation. Being able to use the job queue to just say, “Okay, now do this lots of times, write my data to disk, and then I’ll analyze it afterwards.” It’s the user-facing side of things is that for me are pretty cool. Being able to work interactively, and then just send this chunk is expensive. Being able to say, “Now go do it on the HPC nodes” is a pretty cool thing to be able to do.

HBA: That’s awesome. What I’ve been hearing you say is that Dask is super useful, but Dask in conjunction with other interactive parts of your stack—such as Jupyter Lab, ipywidgets, and the interactive nature of Python in general—is even more useful.

LH: Yeah. But it’s definitely also pretty great for the non-interactive batch workflow as well where Dask is shaving off time from massive computations from weeks to days to hours. That’s probably where most of the industry is benefitting right now.


Subscribe to our newsletter below and we’ll let you know when we’ve put Part 2 of our interview with Lindsey live, along with all the other content that we’re working on.

Share

Sign up for updates