I’m trying to identify unexploded ordnance from electromagnetic data. These are basically bombs or munitions that didn’t go off and are buried in the ground somewhere.
We recently spoke with Lindsey Heagy, Postdoctoral Researcher in the Department of Statistics at UC Berkeley, about her experiences with Dask. Lindsey shared how Dask significantly decreased the time it took to run her geophysical simulations and how that changed how she thinks about her science.
We enjoyed our chat so much we followed up with her to share her favorite Dask features and her thoughts on distributed computing more generally. Oh, and how she uses Dask with machine learning to perform the role of bomb technician (well, the datat-science-y part!) from her laptop.
In this post we cover the following topics:
- Favorite Dask features
- Dask pain points
- The rest of Lindsey’s computational stack
- Machine learning meets Dask (with explosives!)
This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.
FAVORITE DASK FEATURES
HBA: Can you tell me a bit about your favorite Dask features?
LH: The dashboard is extremely useful for just getting an understanding of how Dask is working and how it is operating. I also really liked the Dask job queue. I also think a really empowering step for me is realizing that it actually is this easy to use HPC. Like, with Dask, I didn’t have to take my interactive workflow and completely rewrite everything. I could basically just drop in the high-powered piece exactly where I needed it. No more, no less. And that made me rethink how I use some of these resources.
HBA: That’s great. So you’re taking advantage of Dask being a nice abstraction over NumPy, pandas, and other popular Python libraries.
LH: It’s a more natural way of interacting with HPC, I think. I can prototype an example on my laptop to make sure things are running. But then I can drop the same notebook onto an HPC center and then just basically say, “Okay, this cell is expensive. I want this chunk sent to the HPC nodes.” That’s really easy to do with Dask. Whereas if I actually had to rewrite my notebook into a script that I’m going to then connect to Slurm with a batch job. Like that’s a lot of steps to take. All I actually needed was, “This chunk of my computation is expensive. Handle it.” And Dask lets me do that.
DASK PAIN POINTS
HBA: So Dask reduces friction in a lot of ways for you. Speaking of friction, are there any pain points that you have with Dask?
LH: There are a few that we’ve encountered. Some are a bit more obscure.
When interfacing to codes that have their own parallelization, you have to be really careful. So for example, we use linear algebra solvers that are shipped with MKL. But they have their own parallelization under the hood. You need to make sure that you don’t set the threads for that under the hood to only be single threaded. It’s a simple solution once you know what’s going on but it’s an easy thing to miss. It’s dangerous because the resulting matrix is the right size but it’s often just complete nonsense, which is a really scary thing.
But I think this is a bigger problem than Dask, right? Each algorithm might have its own strategy for parallelizing code. Dask lets us parallelize operations in Python, but in many applications, we want to call those algorithms which are written in lower-level code like C or Fortran. If we aren’t careful, these two sources of parallelization can end up running over each other without an error message being thrown. I mean, parallel computing in general is hard to debug. The trick to finding if and when things have gone awry is getting a sense of the memory and CPU load of your job and getting the right number of parallel threads going at once or on a single node. Figuring these things out as a user just takes a bit of fiddling.
And then I think the other thing is that for some of these longer, more CPU-intensive computations, the current dashboard where you see things happening quickly isn’t necessarily the most natural to interact with. Like, if you have something that’s going to be running for a couple of days, maybe there’s another reporting mechanism that makes sense for those styles of computation.
THE REST OF THE COMPUTATIONAL STACK
HBA: That’s helpful and makes sense. For the rest of your computational stack, what other software do you use?
LH: We pretty much stick to core Python. Basically NumPy, SciPy, matplotlib. And then, if you like, C wrappers of solvers and things like that.
I’m more recently on the machine learning side of things so I’ve been using a lot of PyTorch and TensorFlow. scikit-learn, as well. I’m starting to get more familiar with xarray because I think there are a lot of spaces where it could be quite useful.
MACHINE LEARNING MEETS DASK (WITH EXPLOSIVES!)
HBA: Can you remind me which of the things you work on that you use machine learning for?
LH: Right now, the neural network example is I’m trying to identify unexploded ordnance from electromagnetic data. These are basically bombs or munitions that didn’t go off and are buried in the ground somewhere. For example, places where there had been military training grounds or wars, and where they want to remediate the land. But there’s often also a ton of scrap metal at these sites. We want to detect those metallic objects underground. But then if you go to a site and you detect 300,000 objects and only 100 of those are ordnance, the cost of going and digging up any item that you suspect is ordnance is extremely expensive. You basically send somebody down with like a tiny shovel and a toothbrush, and they have to be a bomb technician and each time they need to ask, “Is this hazardous?”
Our goal is to discriminate or classify what is ordnance and what is not. And so, in the past, folks have been doing this with inversion, where they try to estimate parameters that are intrinsic to the object. Parameters like polarizabilities, how strong of a dipole is this thing in three dimensions. And so then they match that to a library of known objects. So you go through this step, where we’re now gonna solve Maxwell’s equations to try and identify some parameters and then library-match that.
What we’re trying to do with machine learning is saying, “Okay, can we train a neural network on a bunch of data that we can simulate?” And then say, “Can we directly input data to the neural network and get back a probability that this thing is an ordnance object or not?” So the goal then is to produce a probability map over your site that says, we think there’s something here, here, and here, and the rest looks like it’s clutter or background. That’s where I’ve been using more PyTorch at this point.
HBA: That’s really cool. So you use Dask to parallelize these large scale simulations and then the data, which is the output of the simulation, is used to train machine learning models. You don’t have Dask interacting with PyTorch per se, as that’s decoupled.
LH: Yes. At this stage, to generate training data, Dask can be a powerful tool for running many simulations at once. I can also see opportunities for exploring neural network architectures and exploring parameter spaces, but at the moment, we are very much in the exploratory phase, and not quite ready to scale.
HBA: Thank you for your time, Lindsey.
LH: My pleasure.