Dask Tutorial: How to Learn Dask in 2021

James Bourbeau November 16, 2020

, , , , , ,

Dask is the open-source software tool for scalable data science and machine learning. Dask was developed to scale computational libraries like Numpy, Pandas, and Scikit-Learn and work with the existing Python ecosystem to scale to multi-core machines and distributed clusters. As Matthew Rocklin, a creator of Dask and CEO of Coiled has written, “Dask’s story is the story of the scientific Python ecosystem as a whole, both in terms of the community of developers and how Dask interoperates with packages such as NumPy, Pandas, Scikit-Learn, XGBoost, and Xarray, among others.” 

As is the guiding philosophy behind OSS, Dask is a community-driven project, and the content in this Dask tutorial follows suit. The open-source curriculum below pulls from diverse resources, experts, and platforms to guide you in learning Dask in 2021 via the most straightforward path possible. Enjoy!

dask tutorial

Prerequisites

We recommend that learners are comfortable with Python, NumPy, and pandas before they start with Dask. It will also be helpful to have machine learning experience for the brief Dask-ML portion of the curriculum.

Time commitment

We estimate that the speediest learners can complete this content in five hours, though we recommend setting aside an extra hour or two to fully explore the coding exercises.

Introduction to Dask (7 min)

In this short YouTube video, Matt Rocklin provides a general overview of the Dask project.

Why Dask? (12 min)

This page on dask.org gives high-level motivation on why people choose to adopt Dask.

Read here.

Dask Use Cases (2 min)

This short YouTube video demonstrates a few prominent Dask use cases.

YouTube + SciPy 2020 Dask Tutorial (Part 1)

Here’s where you’ll dive deeper into Dask’s most prominent features and how to apply them. We combine short videos from the Dask YouTube channel and content from a SciPy 2020 Tutorial to create a clear learning path.

Note: SciPy 2020 was the 19th annual Scientific Computing with Python conference. The Parallel and Distributed Computing in Python with Dask tutorial was hosted by James Bourbeau (Software Engineer at Coiled!), Mike McCarty, and Dharhas Pothina​. The materials from that tutorial are now available for free online.

Before you begin, remember that you are welcome (and encouraged) to take breaks as you work through the tutorial. While SciPy attendees had to learn this content in one four-hour chunk due to conference time constraints, you have the luxury of on-demand learning.

Set up your environment or open the Binder link (10 min)

To code along in the SciPy tutorial, you’ll need to either set up your environment locally or use the online environment they provide via Binder.

To set up your environment locally, follow the instructions in the Prepare section in the tutorial’s GitHub repository.

To use their Binder environment, click the launch binder logo at the top of the Dask Tutorial section.

Tutorial overview (10 min)

  • Open 00_overview.ipynb in your chosen environment
  • Watch from 0:00:00 to 0:09:55

Dask Delayed (48 min)

Dask Array (68 min)

Dask DataFrame (49 min)

Best Practices (15 min)

As you’ll experience in the first half of the SciPy Tutorial, it’s easy to get started with Dask’s APIs. To transfer that knowledge into real-world scenarios you’ll encounter, you need a game plan. This page contains suggestions for best practices and includes solutions to common problems (primarily focusing on the Array, DataFrame, and Delayed APIs).

Read here.

YouTube + SciPy 2020 Dask Tutorial (Part 2)

Dask Distributed (29 min)

  • Open 05_distributed.ipynb in your chosen environment
  • Watch and code along from 2:39:27 to 3:08:10

Dask Distributed advanced (30 min)

  • Open 06_distributed_advanced.ipynb in your chosen environment
  • Watch and code along from 3:09:01 to 3:39:11

Dask-ML (23 min)

The machine learning section of the SciPy 2020 Tutorial was cut short due to time constraints. Fortunately, the Coiled team covered similar material in a Data Umbrella & PyLadies NYC Tutorial in October 2020. To set up your environment locally or open the provided Binder link (recommended), visit the Getting set up computationally section of this GitHub repository.

  • Open 04-scalable-machine-learning.ipynb in your chosen environment
  • Watch and code along from 46:48 to 1:09:54 (note: the name of the notebook in this video is different than the name of the notebook above but the content is the same)

That’s our open-source curriculum! If you’d like to explore more applications of Dask, we encourage you to check out examples.dask.org.

Plus, Coiled offers dedicated Dask training sessions. Courses are taught on-site or delivered virtually in small groups of 10-20 people. We mix short lecturettes with hands-on Jupyter notebooks with instructors guiding progress and providing direct help. Visit our training page to learn more and view our upcoming sessions.

Don’t hesitate to reach out to the Coiled team with any questions or feedback you may have now that you’ve worked through the Dask tutorial. In the meantime, we’d love for you to take Coiled Cloud for a spin and learn more about our mission to provide scalable analytics for Python and Dask. Try it out for free when you click below.

Frequently Asked Questions


PyData libraries like NumPy, pandas, and scikit-learn are widely popular for data science today, both with individual data professionals and enterprises. These libraries are great, however, they are limited to single-core usage and do not scale beyond the available RAM. This is where you would need Dask. It is Python-native, builds on top of these familiar PyData libraries, and allows you to leverage the distributed and parallel computing potential of your computer.

Dask DataFrame is a high-level Dask API that extends the pandas API for parallel and distributed computing in Dask. It allows you to work with larger-than-memory data on your local machine, or even with TB-scale data on distributed clusters on the cloud, all while following a syntax similar to pandas.

A pandas DataFrame is a data structure that can store multidimensional, labeled, arrays along with some metadata. A Dask DataFrame consists of multiple pandas DataFrames, called partitions. Hence, a Dask DataFrame computation can be understood as performing relevant computations on all of its partitions (pandas DataFrames), in parallel fashion.

There are three main ways to implement parallel processing with Dask:
  1. Using the high-level Dask APIs: Dask provides parallel alternatives for common PyData libraries like NumPy, pandas, and scikit-learn, that have familiar syntax you can use directly.
  2. Using the low-level Dask APIs: Dask also allows you to write custom code, both parallel and distributed, with low-level APIs that access the Dask Engine.
  3. Using the tools built on Dask: There are numerous libraries that are built using Dask like Prefect, PyTorch, RAPIDS, and more, that you can use for specialized use-cases.

Dask Bag can be used to work with large-scale unstructured data formats like JSON. You can use the Bag API to perform standard operations like map, filter, and fold, and then convert the data into other structures like a Dask DataFrame object for dedicated analysis.

Dask Delayed is a low-level Dask API that allows you to write custom parallel Python code. You can apply parallelizable operations to any general Python code and access the Dask Engine directly for your computations.


Ready to get started?

Create your first cluster in minutes.