Who Uses Dask?
• January 4, 2021
Dask is a free and open-source library for parallel computing in Python. Dask helps you scale your data science and machine learning workflows. Dask makes it easy to work with Numpy, pandas, and Scikit-Learn, but that’s just the beginning. Dask is a framework to build distributed applications that has since been used with dozens of other systems like XGBoost, PyTorch, Prefect, Airflow, RAPIDS, and more. It’s a full distributed computing toolbox that fits comfortably in your hand.
As we’ll see in this post, researchers, data scientists, banking firms, and even government agencies use Dask across various domains, including:
- Retail: Walmart, JDA, Grubhub
- Life Sciences: Harvard Medical School, Chan Zuckerberg Initiative, Novartis
- Finance industry: Barclays, Capital One
- Geophysical facilities: NASA, LANL, UK Met Office
- Softwares: RAPIDS, Pangeo, PyTorch, XGBoost, XArray
This post was inspired by Matthew Rocklin’s “Who uses Dask?” talk, which you can check out below:
If you use Dask or know of use cases we haven’t covered, please let us know at firstname.lastname@example.org!
Dask is used by retail corporations, from massive chains to small startups. Dask is used at
- Walmart, for demand forecasting
- Blue Yonder, to analyze terabyte-scale data
- Grubhub, as the Pythonic solution for scalable data science across their entire pipeline
Walmart uses Dask for forecasting the demand for 500,000,000 store-item combinations. Their goal is to make sure an in-demand item is available in sufficient quantities at all their outlets. This is a huge computational problem, but they got a 100x acceleration using RAPIDS and XGBoost, softwares that use Dask under-the-hood. According to John Bowman, the director of Data Science Labs at Walmart,
“We have a 12-hr window to develop the features, train the models, and push the forecasts to the downstream system. The acceleration we’re seeing with random forests and gradient boosting is on the order of 100 to 1.”
Blue Yonder uses Dask to process terabytes of data on a daily basis. They can write pandas-like code in Dask, which can then be pushed directly to production. This helps keep their feedback cycles short and waste low.
Grubhub uses Dask alongside Tensorflow for pre-processing and ETL. In a Python-based stack, it gets challenging for them to work with Spark/JVM. Swapping out Spark with Dask allows them to continue working in Python and still get the functionalities they need. Alex Egg, Data Scientist at Grubhub says:
“Dask helps us iterate faster because it allows us to consolidate onto one stack (python vs Spark/JVM). We no longer need a piecemeal approach where ETL is done on a Spark/JVM cluster and then development switches over to a GPU machine. Most tasks can now be done on one machine with Dask, or with a Dask cluster”
Source: SIM AT HMS
Researchers choose Python for data science today, thanks to Python’s readability, rich ecosystem, and active community. Dask fits perfectly with the Python Open Science ecosystem, making it the scalable computing library of choice for scientists at Harvard Medical School, Novartis Institute for Biomedical Research, and other research labs.
Dask is used for high resolution, 4-dimensional, cellular imagery by Harvard Medical School, Howard Hughes Medical Institute, Chan Zuckerberg Initiative, UC Berkeley Advanced Bioimaging Center, and more. They record the evolution and movements of a 3-dimensional cell over time, in maximum detail. This generates large amounts of data that is difficult to analyze with traditional methods. Dask helps them scale their data analysis workflows with its familiar API that resembles NumPy, pandas, and scikit-learn code. According to Talley Lambert, a microscopy researcher at Harvard:
“Dask lets me prototype pipelines on my laptop and scale easily to my institution’s compute cluster. The fact that it mimics common APIs made adopting it nearly effortless… Before Napari and Dask we typically turned to cherry-picking 2D planes”
Dask is also used at the Novartis Institute for Biomedical Research to scale machine learning prototypes. Eric Ma, from Novartis says
“It was painless to get setup since all of our data were on-premise and HPC accessible. I did not have to jump out of my existing workflow. Workflow is the most important piece for a data scientist, it is the culmination of our habits and systems that keep us efficient. Dask lets Python-writing data scientists preserve idiomatic and widely-used workflows, while giving us easy scalability.”
The finance sector is a major user of Dask today, it’s used by
- Capital One for accelerating ETL and ML pipelines
- Barclays for financial system modeling
The Machine Learning team at Capital One goes through each section of their workflow and accelerates them, oftentimes using Dask. For instance, they went from training a pipeline in 2.5 weeks to a matter of days using Dask. They got another 10x the speedup using RAPIDS – a GPU acceleration tool built with Dask.
Barclays uses Dask for financial system modeling, also known as credit-risk modeling. The focus here isn’t on big data, but on complex systems. As shown in the above figure, financial modeling combines hundreds of individual models with complex dependencies. Dask is well equipped to handle this complexity. Softwares like Spark and MapReduce require regular data, while Dask thrives in this messiness, making it the best option for this job. In their own words:
“We love that Dask lets us write these down in Python. It’s great that we can log every transaction. Dask’s pluggability helps us meet our regulatory requirements”
Many other banking firms and hedge funds use Dask. Your credit card probably runs on Dask too!
Dask is used in Climate Science, Energy, Hydrology, Meteorology, and Satellite Imaging.
Oceanographers produce massive simulated datasets of the earth’s oceans. They couldn’t even look at the outputs before Dask. Dask is allowing scientists to become scientists again, instead of thinking about data engineering and IT concerns. Chelle Gentemann, from the Farallon Institute says:
“I use Dask to decrease the time spent on computation and programming and increase the time I have to analyse results.”
Researchers at the Los Alamos National Labs look at large seismology data from sensors around the world. They had to use C++ and MPI to analyze this large time series data. Dask allows them to work with these large datasets in a language they are comfortable in, Python. As Jonathan MacCarthy says:
“Before Dask, large-scale research questions were too “expensive” to ask, because we would’ve had to abandon the research code written in Python. Now, we’re able to rapidly gain insights into these questions with minimal changes to our research tooling”
The UK Meteorology Office is the national weather service for the United Kingdom, and their Dask story is fascinating!
The UK Met Office collects a large number of observations from satellites and weather stations around the world, and runs big simulations. They have around 500 analysts that analyze this data every day because weather predictions are critical, knowing about the next hurricane or storm can save many lives.
Before Dask, they used an in-house software called Iris. They had legacy systems running on Iris and a lot of in-house expertise with Iris. They wanted Dask’s flexibility and scalability, so they worked with the open-source community to replace the internals of Iris with Dask. The employees could continue using Iris which now has Dask running internally. They’re also transitioning to a cloud architecture and Dask remains helpful because of its flexibility — it can run anywhere, even the cloud!
The UK Met Office has also influenced other national weather services to use Dask, like the NCA, NASA and USGS.
Many people don’t even know that they are using Dask because Dask has become infrastructural and runs underneath the systems they interact with. Dask is integrated into many libraries that are more efficient at doing a particular task. For example, geospatial software like Pangeo, and xarray; timeseries software like Prophet, and tsfresh; ETL/ML software like scikit-learn, RAPIDS, and XGBoost; workflow management tools like Apache Airflow, and Prefect; all use Dask!
Tens of thousands of individuals use Dask across scientific research labs to industries. At Coiled, we want to make Dask accessible to everyone, everywhere. We’re developing products around Dask, helping more people learn Dask, and building a community of Dask enthusiasts. Try Coiled today by clicking the link below!