Dask, the open source package for scalable data science, was developed to meet the needs of modern data professionals. This post describes the evolution of the Dask project and how it meets the needs of people working with medium-to-large datasets across industries (such as energy, finance, and the geosciences) and basic research (such as astronomy, biomedical imaging, and hydrology). It also reveals how the Dask story is the story of the scientific Python ecosystem as a whole, both in terms of the community of developers and how Dask interoperates with packages such as NumPy, Pandas, Scikit-Learn, XGBoost, and Xarray, among others. It describes the evolution of Dask for single-machine parallelism, the integral task scheduler, distributed execution across a cluster, and how Dask has been increasingly adopted at a grassroots level and in the enterprise.
Imaging of a cell made possible by Dask, Talley Lambert, Harvard Medical School
The Anaconda Years 2014-2015
DARPA and Blaze, the proto-Dask
Dask was originally developed at Continuum Analytics, a for-profit Python consulting company that eventually became Anaconda Inc, the creator of many open source packages, and the popular Anaconda Python distribution.
Dask grew out of the Blaze project, a DARPA funded project to accelerate computation in open source. Blaze was an ambitious project that tried to redefine computation, storage, compression, and data science APIs for Python, led originally by Travis Oliphant and Peter Wang, the co-founders of Anaconda. However, Blaze’s approach of being an ecosystem-in-a-package meant that it was harder for new users to easily adopt. We learned that rewriting a whole software ecosystem is hard. This is both for the obvious technical challenges, as well as cultural ones (data scientists adopt tools one-at-a-time on an as-needed basis). As a result, we started to intentionally develop new components of Blaze outside the project (blosc, dask, dynd, etc.). This made the software easier to adopt in a broad variety of settings, and generally helped to enable greater community adoption.
We learned that augmenting the existing ecosystem with just the right component had far more impact than trying to move people to a whole new ecosystem. With this principle of minimalism in mind, Dask was born. We all had general, wide-ranging conversations about the direction we wanted the work to go in. It was the end of 2014, I was back home for the Christmas holidays sitting under the California sun, and had some time to relax and think. I wanted to come up with the simplest way to do parallel NumPy operations. Dask’s first commit was on December 22, 2014. It was a small and simple commit, just 16 lines of code, that would still run Dask computations today, albeit very slowly. Starting simple paid off because it allowed other projects and other developers to engage quickly and use Dask in all sorts of unexpected situations. The simpler the tool, the broader the use cases. But this begs the question: why did we start with NumPy?
Dask Parallelizes NumPy
At the time, Spark was getting some traction, but there wasn’t a good story in Python to handle parallel computing. So the question was “how do we parallelize the existing SciPy stack? All the libraries?”
More specifically, one pain point we heard time and time again was that people worked with data that fit comfortably on disk but that was too big for RAM, and accelerating NumPy was a common feature request to Continuum/Anaconda at the time. To this end, the purpose of Dask was originally to parallelize NumPy so that it could harness one full workstation computer, which were common in finance shops at the time. There were two goals technical goals, and a social goal:
- Technical: Harness the power of all of the cores on the laptop/workstation in parallel;
- Technical: Support larger-than-memory computation, allowing datasets that fit on disk, but not in RAM;
- Social: Invent nothing. We wanted to be as familiar as possible to what users already knew in the PyData stack.
Failure of MapReduce, and the rise of Task Scheduling
The first versions of Dask tried to use map-reduce style parallelism with Spark behind the scenes. However it quickly became obvious that this paradigm for parallelism wasn’t sufficiently flexible for the kinds of algorithms that Python NumPy developers used on an everyday basis. There had been many attempts of NumPy-on-Spark and none of them had really succeeded. We had to make our own engine for parallelism.
This led us to the parallel task scheduler, the beating heart of Dask. To make parallel NumPy, we actually had to make a fairly sophisticated task scheduler that could handle arbitrary data dependencies, and was lightweight enough to be adopted within the Python Data science ecosystem.
Then building parallel NumPy on top of that scheduler was fairly straightforward.
Early Dask Community Adoption 2015-2017
The first audience Dask captured was other library maintainers.
Xarray and Scikit-Image
The first projects to really adopt Dask were Xarray (commonly used in geo sciences) and Scikit-Image (commonly used in image processing). The authors of these libraries knew NumPy well, and were excited by a project that looked almost exactly like NumPy, but solved one of their biggest pain points: data that fit comfortably on disk, but was too big for RAM. Dask’s sweet spot in the early days was the 10GB-100GB range, which is still a very common range for practical applications. Dask was lightweight enough to adopt and depend on, and familiar enough with the Python maintainers to work with comfortably.
Dask was integrated into Xarray within a few months of being born. That gave it its first user community, which remains strong to this day. It’s important to make clear how key community is to all of these projects: for example, at this time Stephan Hoyer, the main contributor to Xarray, and I lived quite close to one another. He took a two week sabbatical, during which we hooked Dask and Xarray up. So from day one, it was clear that the Dask team was more than the core Dask team. It’s the Dask team plus the broader scientific Python developer community, a theme we see play out time and time again, to this day.
Many of our clients at the time were thrilled with the parallel NumPy implementation, however a surprising number of them saw the task scheduler and said
“Hey, we might have some uses for this task scheduler in some of our other workloads.”
This usage of the internal task scheduler was originally really surprising to the Dask maintainers, but quickly became natural. The flexibility of Dask’s task scheduler was a surprising but amazing feature that quickly defined the project.
Pandas, Scikit-Learn, and others
Image from Jake VanderPlas’ keynote, PyCon 2017
Once Dask was working properly with NumPy, it became clear that there was huge demand for a lightweight parallelism solution for Pandas DataFrames and machine learning tools, such as Scikit-Learn. Dask then evolved quickly to support these other projects where appropriate. Maintainers of those libraries were excited by the prospect of a lightweight parallelism solution that let them add parallelism around their projects without having to buy into a huge piece of machinery. Dask was rapidly adopted because it was pure-Python, and relatively simple. To this day, the single-machine task scheduler in Dask is only about a thousand lines of Python code.
The Scikit-Learn developers, in particular, pushed Dask to expose the task scheduler not only to library authors, but to traditional ML users. And so we took the joblib.delayed function decorator and created dask.delayed, Dask’s first API for general parallelism outside of any “big data” abstraction like NumPy or Pandas. This came out of a long debate with Gael Varoquaux (a Scikit-Learn maintainer) outside of a brasserie in Lille, just before the ICML conference that year. I was actually fully against dask.delayed at the start, but Gael showed up the next morning with a working implementation and proved me wrong. Anyone could use Dask to parallelize pretty much anything in Python. It revolutionized how Dask was used ever since.
- Complex workloads like cross-validated GridSearch across pipelines,
- Larger than memory workflows.
Their efforts helped to forge the development of custom algorithms, and planted the seeds for Dask-ML. Similarly, other members in the Python community, like Min Ragan-Kelley from Project Jupyter, and the author of IPython parallel (the best distributed Python solution at the time) helped us to design the first dask distributed scheduler for execution across multiple machines. These interactions speak to how the Dask community is the broader PyData community.
During the next few years, Dask slowly grew into dozens of libraries across hundreds of sectors and disciplines. Users, now numbering in the thousands, gave talks at community events, conferences, and their workplaces and shared it with their colleagues. I recently found an image of me speaking about Dask at PyData DC in 2016, which you can see below (oh how working in OSS development has aged me!).
While Continuum/Anaconda was still generously paying for Dask maintenance, it remained hands-off in terms of governance and direction. This allowed Dask to evolve organically to support the PyData ecosystem and community. Along with libraries like Bokeh, Dask was somewhat unique in being supported, but still operating as a community governed and community led project.
Distributed Execution: Dask on a Cluster
For the first year of Dask’s life it was focused on single-machine parallelism. This was (and still is) a highly pragmatic space. You can fit a lot of CPU cores and memory into a single box, and not worry about networking, security, authentication, or any of a hundred other problems that arise when you switch to clusters of machines.
But inevitably, Dask was used on problems that didn’t fit on a single machine. This led us to develop a distributed-memory scheduler for Dask that supported the same API as the existing single-machine scheduler. For Dask users this was like magic. Suddenly their existing workloads on 50GB datasets could scale comfortably to 5TB (and then 50TB a bit later).
This took a few months and several iterations. The current dask-distributed scheduler is actually the fourth scheduler that we built. We were helped along the way by Min Ragan-Kelly from IPython parallel and Olivier Grisel from Scikit-Learn/Joblib. It’s key to note that, while the impetus for Dask’s distributed-memory scheduler came from the open source community, the resulting technology began to meet the needs of teams and businesses and thus paved the way for institutional and enterprise adoption of Dask.
Institutional Development of Dask
Starting around early 2018, the Dask community started to see a lot more activity not just among grassroots community users, but also among teams and institutions in academia, tech companies, and large corporations. For example:
- Geoscience consortiums like Pangeo aggregated developers from geophysical groups such as NASA, the UK Met Office, NCAR, USGS, and others to support their domain with Dask;
- Logistics firms like Blue Yonder/JDA and various financial trading companies improved stability and observability for long-term production workloads;
- Hardware companies like NVIDIA started investing in large internal teams to ensure that Dask worked well on their hardware;
- Large users like Capital One, started to architect large production systems internally, and so invested more time in developing and maintaining the open source project.
This brought an influx of development power beyond volunteer time and the few paid maintainers at Anaconda. As a result, the original developers started spending most of their time coordinating development of other engineering teams, and figuring out how to smoothly on-board and assimilate a rapidly growing set of developers.
NVIDIA in particular was excited about investing millions of dollars in Dask development to scale out their new effort RAPIDS (see keynote from NVIDIA CEO on the PyData stack), but were more comfortable doing so if they had a Dask maintainer in-house to kick-start the effort (hence my personal year-long stint at the company).
The foundations of Dask today are strong and stable. Today we see two directions of growth:
- Expansion to new verticals: We see the Python data science community integrating Dask into a broad variety of domain-specific libraries. Because Dask is so easy to work with (lightweight, pure python, scales up and down), and because it is a well-trusted, community governed, and multi-institution project, library maintainers are comfortable depending on it to serve downstream communities. Such applications include the geo sciences, biomedical imaging, GIS (geo-spatial and urban environments), and time series processing;
- Core Performance: as Dask is used on larger institutions and datasets with more advanced hardware we’re getting to the point where it makes sense to invest again in core performance. We see groups like NVIDIA and Pangeo leading this charge.
This brings us pretty much up today and I’m excited to be a part of what the future holds for both Dask and the broader PyData ecosystem. To learn more about Dask and what we’re working on at Coiled to help meet the needs of teams and the enterprise in scaling data science, analytics, and machine learning, sign up to our email list below.