What is Dask?
• December 10, 2020
Glad you asked! Dask is a free and open-source library for parallel computing in Python. Dask helps you scale your data science and machine learning workflows. Dask makes it easy to work with Numpy, pandas, and Scikit-Learn, but that’s just the beginning. Dask is a framework to build distributed applications that has since been used with dozens of other systems like XGBoost, PyTorch, Prefect, Airflow, RAPIDS, and more. It’s a full distributed computing toolbox that fits comfortably in your hand.
If you have larger-than-memory data, you can use Dask to scale up your workflow to leverage all the cores of your local workstation, or even scale out to the cloud.
This article will discuss:
- How Dask can help parallelize your data science computations,
- Behind-the-scenes workings of Dask with schedulers and workers, and
- Dask’s diagnostics dashboard and resources for scaling to the cloud.
A prominent feature of Dask is its familiar API. As Matthew Rocklin, the creator of Dask wrote in A Brief History of Dask:
“One pain point we heard time and time again was that people worked with data that fit comfortably on disk but that was too big for RAM, and accelerating NumPy was a common feature request to Continuum/Anaconda at the time. To this end, the purpose of Dask was originally to parallelize NumPy so that it could harness one full workstation computer, which was common in finance shops at the time. There were two technical goals, and a social goal:
- Technical: Harness the power of all of the cores on the laptop/workstation in parallel;
- Technical: Support larger-than-memory computation, allowing datasets that fit on disk, but not in RAM;
- Social: Invent nothing. We wanted to be as familiar as possible to what users already knew in the PyData stack.”
Dask DataFrames for Parallel pandas
Let’s see an example using the TLC Yellow Taxi Trips dataset.
The pandas code to read the data for January-2019 and compute the average tip amount as a function of passenger count is given below.
This is fairly straightforward, right? Now, let’s try to read the entire Taxi Trips dataset and perform the same operation. You will notice the complete 10+ GB dataset doesn’t fit in a typical laptop RAM storage. This is where Dask shines through: you can use Dask DataFrame to parallelize this operation with the following code.
Note the Dask version is almost identical to the previous pandas code, except where you need to call compute() on the Dask DataFrame object to get the output of your computation. This is required because Dask DataFrames are “lazy” in nature. It holds all the required functions and relationships to compute the result and performs the computation only when necessary.
Dask Arrays and Dask-ML
Similar to Dask DataFrames for parallelizing pandas code, you can use Dask Arrays for executing NumPy code in parallel.
And, Dask-ML for distributed machine learning with Scikit-Learn.
Under the Hood – Collections, Schedulers, and Workers
Dask has two main components, as shown in the diagrams:
- Collections, which set up the steps to do a computation in parallel, and
- Schedulers, which manage the actual computation work.
Dask collections provide the API that you use to write Dask code. For example, Dask DataFrame used in the previous section is a Collection.
Collections create task graphs that define how to perform the computation in parallel. Each node in the task graph is a normal Python function and edges between nodes are normal Python objects. You can view the graph by calling visualize() on any collection object.
Dask has several collections:
- High-Level collections:
- Low-Level collections:
All of these collections have the same underlying parallel computing machinery, but each provides a different set of parallel algorithms and programming style.
After Collections generate the task graphs, Dask needs to execute them on parallel hardware. The actual computation is performed on a “cluster”, which consists of
- A scheduler, that manages the flow of work and sends the tasks to the workers, and
- workers, that compute the tasks given to them by the scheduler.
Finally, at the very beginning of the process, there is a “client”. The client lives wherever you write your Python code. It is the user-facing entry point that passes on the tasks to the scheduler.
Oh, there’s more – Dashboards!
Dask provides an interactive dashboard to help you diagnose the state of your cluster. The dashboard contains multiple plots and tables with live information, which can be used to inspect the processes and optimize your code. In parallel and distributed computing, there are new costs to be aware of, so your old intuition may not work. The dashboards can help you relearn what is fast and slow and how to deal with it.
The diagnostics dashboard uses Bokeh plots, so you can interact with the plot objects using the Bokeh tools like hover, zoom, tap, pan, etc. The dashboard typically lives at your `localhost:8787`. You can also access the dashboard plots directly in JupyterLab using the dask-labextension. We recommend always using the dashboard while using Dask.
The different plots and tables include status indicators, system information, logs, etc. Let’s take a deeper look at some of them.
The cluster map, frequently called the “pew-pew map” because of the laser-like lines shooting out, is one of the coolest visualizations in the Dask dashboard. It shows the task scheduler in the center with the workers around it, and helps visualize the interactions among them. The workers change color corresponding to the task they are computing at any given moment.
The task stream is a real-time visualization of the tasks being executed by each worker. The task stream saves the state of the workers, showing a history of the tasks performed at the end of the computation. Blank spaces in the graph denote idle time and the red blocks denote communication. The plot is interactive, so you can zoom in and hover over the blocks for more details.
The progress bar shows the progress being made on each task during parallel execution. Each task has a different color associated with it. These task colors remain consistent throughout the dashboard to help you make inferences quickly.
The Cloud and Beyond
Dask lets you scale out your computation to cloud platforms like AWS, Azure, and Google Cloud Platform. It does this by connecting to services like Kubernetes and Yarn. Dask has an active community for Kubernetes and Helm integration, Dask-Yarn, and Dask Cloud provider. To learn more, you can go through the project documentation on dask.org and check out tons of awesome Dask use cases here.
If you want to focus on your analyses and delegate the DevOps part of scaling, you can check out Coiled. Coiled does exactly that, while also giving you a helpful interface, team management tools, and full control over your cloud instance.