Nicholas Sofroniew, Imaging Tech Lead at Chan Zuckerberg Initiative, and Talley Lambert, Microscopist and Lecturer at Harvard Medical, recently joined us to chat about viewing and processing large datasets, with examples from the bio-imaging world. They’re experts in this area as developers of the napari package, which they showed us how to best use.
“This is a cell undergoing mitosis. Preprocessing and viewing [this] dataset interactively is unique, and Dask makes it easy.”
In this post, we’ll summarize the key takeaways from the stream. We cover:
- napari basics
- napari for small image data
- napari + Dask for big image data!
- Simplifying with dask-image
- Simplifying even more with napari plugins
You can find the code for the session in this GitHub repo and watch the video here:
Introducing Talley and Nick
Understanding what Talley and Nick do helps contextualize the technology they used in the live stream.
- Talley works in Harvard’s core facility helping people design and execute experiments.
- Nick’s imaging tech team provides reproducible bioimage analysis to scientists.
napari is a multidimensional image viewer for Python.
According to Nick, “napari is a multidimensional image viewer for Python. If you’re doing image analysis, often these images might be three-dimensional or four-dimensional with time, or they might have multiple resolutions. There haven’t been many solutions for interactively exploring images that go well with Python data analysis and machine learning tools. We started napari to provide some of that functionality.”
Talley dove into a notebook showing off how Dask and napari are used to process and view large image datasets. First, he imported napari and a utility to help him inside Jupyter Notebook.
Talley continued, “The main offering in napari is usually accessed through napari.Viewer().” There are a handful of file readers built into this object, which Talley assigned to a variable called viewer. He then opened two .tif files with viewer.open().
“This is a single cell, acquired on a lattice light-sheet microscope, that is undergoing mitosis. So in magenta there you’re seeing a chromatin condensed into the little chromosomes and they’re about to divide. And in this channel, we’ve got microtubule tips, these are sort of structural elements for lack of a better term that are going to form the mitotic spindle and pull apart the chromosomes.”
In the image above, we see how napari has a slider for all of the various dimensions, like opacity and contrast. Talley noted, “[napari] also [has] the concept of layers where we can look at multiple different datasets on top of each other blended in various different ways. In this case, we’re just looking at a couple of image layers, but we can overlay analyses like points, vectors, surfaces, shapes, etc.”
Viewing a plain NumPy array with napari
Behind the scenes, we’re ultimately getting the [image] data into some sort of NumPy interface, like a NumPy array.
Like most image viewers, Napari accepts multidimensional NumPy arrays.
Talley contextualized what he’s about to do. “So that was just loading an image. But most of the time, behind the scenes, we’re ultimately getting the data into some sort of NumPy interface, like a NumPy array. For instance, I can read that .tif into a NumPy array using scikit-image’s imread function.”
“So the stack variable here is just a 3D NumPy array, and I can then use napari’s view_image function and I get a viewer.”
This image needs some preprocessing before viewing
The bottom line is that we’re going to need to do some preprocessing before we can even view the data.
These images were taken on a stage-scanning light sheet microscope, which acquires raw 3D data in a skewed (sheared) coordinate space relative to the real world. You can watch this video for more details, but the bottom line is that they need to do some preprocessing before they can even view the data, which is common for a lot of scientific datasets. Talley described the preprocessing required. “In this case, the preprocessing is an affine transform and very often we’ll also do some deconvolution to increase the contrast of the image or cropping, etc.”
He first showed us the result of a deconvolution, noting that he’s not showing us 3D here yet.
He then drew our attention to two functions that he used to get this image: deskew_gpu and decon. “I’ve taken my original stack variable, I’ve applied this deskewing, then I take that deskewed stack and I apply deconvolution.”
Note: deskew_gpu and decon are non-napari functions. Though napari may have these functionalities in the future, it’s awesome that Python can be used to do skewing while that’s not the case. That’s the advantage of building on top of the Python stack — you can use the whole ecosystem!
Talley took a step back and looked at the problem from a distance. “This is just a single image, with a small field of view. But if I saved the raw data, then saved the deskewed so I didn’t have to re-deskew it everytime, then saved the deconvolved data (which everybody likes to look at and is the easiest one to do segmentation on and processing), it quickly adds up.”
It gets worse. “And then we generally have two or three different channels and often we have thousands of time points. So if we’re doing a time lapse experiment that will quickly get up to hundreds of GB that we can collect in 30-60 minutes. And we’ll do that multiple times for a day, and we could do that everyday.”
Talley summarized the problem. “It basically becomes impossible for me to jump around views to answer the question, ‘How did this experiment end?’ ‘Did this cell die?’ ‘Did it divide?’ I need to evaluate my data and yet simply looking at it is hard.”
How do we solve this?
Lazy “just-in-time” IO with Dask
You can see Dask doing work in the background. I can go through time, see the whole cell dividing, and I’m only loading what I need to.
Matt: “If only there were a project that could give you array-like functionality on top of large datasets.”
Talley: “Right. If only. This is where I’m going to start using Dask.” Talley described why and how he’s going to use Dask to read in massive image data. Note, he’s not going to do any preprocessing yet.
“Since this data likely won’t fit into local RAM, we need a way to load images only upon request. We can use dask.delayed to convert our skimage.io.imread function into a lazy version. We then make a dask.array of these lazy arrays.”
Here’s where napari comes in. “napari can then call .compute() only on the necessary frames depending on how the user interacts with the dimension sliders.”
Let’s break that above code down. “I wrap scikit-image’s imread function in dask.delayed and now I have a version of that function that is lazy, so it only gets called on demand. And I’m going to take a whole folder of dataset files here, wrap all of those in this lazy reader, and then make a Dask array from those delayed objects. This cell executes immediately because I’m just declaring the shape of this experiment — it’s 100 time points, each of which is a 3D dataset.”
Talley excitedly showed us how Dask and napari work together. “And if I view that, it opens immediately, and I can fly around and you can see Dask doing work in the background. I can go through time and see the whole cell dividing. I’m only loading what I need to!”
Hugo asked, “So you’re loading each image as we need to view it? As you scroll, right? So stuff to the right hand part of the slider, we’re not loading that until you go over there.”
Talley answered, “Exactly right!”
Lazy pre-processing with dask.map_blocks
As I move around the napari slider, [Dask] will do that [preprocessing] on demand.
Now, we need to solve the preprocessing-at-scale problem. Talley’s tool of choice again is Dask. Specifically, dask.map_blocks.
Talley continued, “That was just lazy reading, but it’s collected in this skewed geometry, so I can’t really review it yet. And so getting back to the deskewing and deconvolution, I can use dask.map_blocks, which lets me map a function across all of the chunks in my Dask array. So I can, in a lazy fashion again, take my delayed Dask array of commands and say, ‘Not only should you open the file and view it, but you should open it, deskew it and deconvolve it and then show it to me.’ And as I move around the napari slider, it will do that [preprocessing] on demand.”
Talley described the situation in more technical detail. “I’ve got these two functions, deskew_gpu and decon. I can take my Dask stack of delayed arrays, and map this function (deskew) and get out a new lazy array on which no computation has happened. I can now take that again and map another function (decon), and what I get out is yet again no computation. I can even do more — I can then crop.” These are three common preprocessing tasks.
Here’s where napari comes in again. Talley throws the output of his lazy Dask code (the cropped variable) into napari to view it.
And here’s what napari gives us.
Talley described what’s happening. “This is now calling the GPU in the background. When I go to a different time point, over on the right you can see Dask doing the work. In 3D, now the cell looks like how a cell should look. It’s been deskewed, the contrast has been bumped up by the deconvolution, and I can go to any point now here.”
Hugo framed the technical accomplishment achieved here. “I love that I’ve never used napari, but the code you wrote is code that I know and love. And it’s very Pythonic, and there’s a relatively small amount that you have to do to generate all of this stuff.”
A lot of this can be wrapped with dask-image.
Talley shared something with us. “I kind of showed you the hard way for education purposes. A lot of this can be wrapped with dask-image. This lazy imread functionality, etc.”
He showed us what the above code wrote he looks like when simplifying with dask-image.
It’s much more concise! It’s still important to know what’s happening in the background, especially in the case you need to tinker with dask-image’s functionality for your specific image reading problem.
A napari plugin on the inside is pretty much just Python and Dask.
Could it get any easier than what Talley showed us with dask-image? Apparently, yes.
“Once you get something that takes a path and gives back a Dask Array or a NumPy array, you can wrap all of that functionality in a napari plugin, which allows someone to literally just drag and drop a folder or zip file onto the napari viewer and it will take care of you.”
Nick emphasized how napari plugins are open-source. “A napari plugin on the inside is pretty much just Python and Dask that you can use without any GUI, so it doesn’t depend on napari at all. That’s been one of our goals, to integrate with the rest of the Python ecosystem so that things can be easily reused.”
Faster Data Science
The Python ecosystem is powerful, and a meaningful part of that is napari’s ability to provide fast, interactive, multi-dimensional image viewing. Thank you to Talley and Nick for sharing their time and expertise with us.
You too can accelerate your data science today on a Coiled cluster. Coiled also handles security, conda/docker environments, and team management, so you can get back to doing data science. Get started for free today on Coiled Cloud.