Dask JupyterLab Workflow

Matt Powers September 14, 2021

,


This post explains how you can set up your machine to have a Dask development workflow that’s identical to what the initial author of Dask uses.

It’s hard to understand how Dask is executing computations when running commands on a cluster.  It’s even harder when the cluster visualization tools are in a different window.

The Dask JupyterLab extension gives you cluster insights in the same screen where you run your notebook code.  The following screenshot shows this setup with notebook code on the left and the cluster metrics on the right.

With this setup, you can run commands and see how they’re executed on the cluster in real-time.

This blog post is organized into two sections:

  1. Running commands on localhost with the conda base environment
  2. Running commands on a cloud cluster with a separate conda environment

Simple setup

Run the following Terminal command to install some packages in your base conda environment.

conda install -c conda-forge dask jupyterlab dask-labextension s3fs

Run the jupyter lab command in your Terminal, and it should automatically open a browser window.  This is what your Terminal will look like after running jupyter lab.

Create a notebook that’ll run some code on your local machine.  You should start by running these three commands in your notebook.

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

Now click on the Dask icon in the leftmost sidebar, which will open the dask-labextension options. 

The icons are grey, which means your notebook isn’t connected with the cluster yet.

Click on the magnifying glass to connect your notebook with the cluster.  The icons turn orange when they’re connected with the cluster.

Click on “Task Stream” and “Progress” and then drag the windows, so they’re all simultaneously displayed on your screen.

Run some computations, and you’ll see the real-time localhost cluster execution statistics in the windows to the right.  You can hover over the visualizations on the right to get more information about the computation.

You can study the charts and inspect the computations Dask executes.

Multi-environment setup

You can also create multiple conda environments, which is better for users that need to manage different sets of dependencies for different projects.

You can create a conda environment from this YAML file in the coiled-resources repo.  Clone the coiled-resources repo and run the following command in the project root directory.

conda env create -f envs/standard-coiled.yml

This will create a standard-coiled environment that you can see when running conda env list in your Terminal.

Notice that the YAML file contains the following dependencies that allow for this nice JupyterLab workflow:

  • ipykernel
  • nb_conda
  • jupyterlab
  • dask-labextension

Run jupyter lab in your Terminal to open the browser editor and you can now create notebooks that use the newly created standard-coiled environment.

Create a notebook and connect it with the standard-coiled conda environment.

Run a few commands to provision a Coiled cluster and then attach the notebook to the cluster.  You need to get your machine setup with Coiled before you can run this code.

Click the Dask icon in the left sidebar and then click the magnifying glass to automagically connect your notebook with the cluster.  You need to run the first three commands to create a client in the active notebook before clicking the magnifying glass.

Click “Task Stream” and “Progress” in the Dask sidebar and drag the windows so they’re arranged as follows:

Click on the Dask icon again to hide the sidebar, so it doesn’t take up screen space anymore.

Run a query, and the Task Stream and Progress will show you exactly how the computation is being executed on the cluster.

Comparison with Spark

Spark has some rudimentary cluster visualization tools, but nothing like what JupyterLab’s Dask extension provides.

Python has a rich data science ecosystem with lots of tools.  The Dask JupyterLab extension was made possible because of the extensible nature of JupyterLab.  Building this from scratch would have been prohibitively time-consuming.

Dask’s integration with the PyData ecosystem is a major advantage.  There are tons of libraries/tools that play nicely with Dask out of the box.

Conclusion

It’s easy to set up your local machine to run computations and visualize cluster execution at the same time.

This workflow lets you see how Dask processes data under the hood.  It’s a great way to level-up your Dask skills.

This post only shows 2 of the 20+ possible cluster visualization tools.  Future posts will do deep dives into interpreting the various visualizations.

Thanks for reading. If you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.

Try Coiled Cloud


Ready to get started?

Create your first cluster in minutes.