How to learn Dask in 2021
• November 16, 2020
Dask is the open-source software tool for scalable data science and machine learning. Dask was developed to scale computational libraries like Numpy, Pandas, and Scikit-Learn and work with the existing Python ecosystem to scale to multi-core machines and distributed clusters. As Matthew Rocklin, a creator of Dask and CEO of Coiled has written, “Dask’s story is the story of the scientific Python ecosystem as a whole, both in terms of the community of developers and how Dask interoperates with packages such as NumPy, Pandas, Scikit-Learn, XGBoost, and Xarray, among others.”
As is the guiding philosophy behind OSS, Dask is a community-driven project, and the content in this post follows suit. The open-source curriculum below pulls from diverse resources, experts, and platforms to guide you in learning Dask in 2021 via the most straightforward path possible. Enjoy!
We recommend that learners are comfortable with Python, NumPy, and pandas before they start with Dask. It will also be helpful to have machine learning experience for the brief Dask-ML portion of the curriculum.
We estimate that the speediest learners can complete this content in five hours, though we recommend setting aside an extra hour or two to fully explore the coding exercises.
Introduction to Dask (7 min)
In this short YouTube video, Matt Rocklin provides a general overview of the Dask project.
Why Dask? (12 min)
This page on dask.org gives high-level motivation on why people choose to adopt Dask.
Dask Use Cases (2 min)
This short YouTube video demonstrates a few prominent Dask use cases.
Dask YouTube + SciPy 2020 Tutorial (Part 1)
Here’s where you’ll dive deeper into Dask’s most prominent features and how to apply them. We combine short videos from the Dask YouTube channel and content from a SciPy 2020 Tutorial to create a clear learning path.
Note: SciPy 2020 was the 19th annual Scientific Computing with Python conference. The Parallel and Distributed Computing in Python with Dask tutorial was hosted by James Bourbeau (Software Engineer at Coiled!), Mike McCarty, and Dharhas Pothina. The materials from that tutorial are now available for free online.
Before you begin, remember that you are welcome (and encouraged) to take breaks as you work through the tutorial. While SciPy attendees had to learn this content in one four-hour chunk due to conference time constraints, you have the luxury of on-demand learning.
Set up your environment or open the Binder link (10 min)
To code along in the SciPy tutorial, you’ll need to either set up your environment locally or use the online environment they provide via Binder.
To set up your environment locally, follow the instructions in the Prepare section in the tutorial’s GitHub repository.
To use their Binder environment, click the launch binder logo at the top of the Dask Tutorial section.
Tutorial overview (10 min)
- Open 00_overview.ipynb in your chosen environment
- Watch from 0:00:00 to 0:09:55
Dask Delayed (48 min)
- Watch Matt’s Dask Delayed introduction video
- Open 01_dask.delayed.ipynb in your chosen environment
- Watch and code along from 0:09:56 to 0:52:21
Dask Array (68 min)
- Watch Matt’s Dask Array introduction video
- Open 03_array.ipynb in your chosen environment
- Watch and code along from 0:52:22 to 1:57:21
Dask DataFrame (49 min)
- Watch Matt’s Dask DataFrame introduction video
- Open 04_dataframe.ipynb in your chosen environment
- Watch and code along from 1:57:22 to 2:39:26
Best Practices (15 min)
As you’ll experience in the first half of the SciPy Tutorial, it’s easy to get started with Dask’s APIs. To transfer that knowledge into real-world scenarios you’ll encounter, you need a game plan. This page contains suggestions for best practices and includes solutions to common problems (primarily focusing on the Array, DataFrame, and Delayed APIs).
Dask YouTube + SciPy 2020 Tutorial (Part 2)
Dask Distributed (29 min)
- Open 05_distributed.ipynb in your chosen environment
- Watch and code along from 2:39:27 to 3:08:10
Dask Distributed advanced (30 min)
- Open 06_distributed_advanced.ipynb in your chosen environment
- Watch and code along from 3:09:01 to 3:39:11
Dask-ML (23 min)
The machine learning section of the SciPy 2020 Tutorial was cut short due to time constraints. Fortunately, the Coiled team covered similar material in a Data Umbrella & PyLadies NYC Tutorial in October 2020. To set up your environment locally or open the provided Binder link (recommended), visit the Getting set up computationally section of this GitHub repository.
- Open 04-scalable-machine-learning.ipynb in your chosen environment
- Watch and code along from 46:48 to 1:09:54 (note: the name of the notebook in this video is different than the name of the notebook above but the content is the same)
That’s our open-source curriculum! If you’d like to explore more applications of Dask, we encourage you to check out examples.dask.org.
Plus, Coiled offers dedicated Dask training sessions. Courses are taught on-site or delivered virtually in small groups of 10-20 people. We mix short lecturettes with hands-on Jupyter notebooks with instructors guiding progress and providing direct help. Visit our training page to learn more and view our upcoming sessions.
Don’t hesitate to reach out to the Coiled team with any questions or feedback you may have now that you’ve worked through the curriculum. In the meantime, we’d love for you to take Coiled Cloud for a spin and learn more about our mission to provide scalable analytics for Python and Dask. Try it out for free when you click below.