Scaling Python with Dask

If you’ve taken your data skills from zero to one with PyData (Pandas, Scikit-Learn, and friends) then this class will help you use larger data sets that won’t fit in memory and will help you distribute your workloads to accelerate your code  with Dask. 

During the first two sessions, you’ll learn to use Python skills you already have to query and transform data, build models, and scale your custom code. During the last two sessions, we’ll peek under the hood to learn how Dask works and “look inside” with real-time animated dashboards. We’ll cover options for deploying clusters, troubleshooting, sample use cases, and best practices.

This entire class is delivered through interactive, web-based JupyterLab notebooks that you can keep and refer to whenever you need. You’ll also receive 10,000 Coiled Cloud credits for 3 months so you can continue your learning journey without limitations.

Learn Ideas and Gain Skills

  • Learn how Dask fits into the Python data landscape, scaling from laptops to clusters
  • Analyze and transform data and train ML models with Dask
  • Parallelize and scale your own code
  • Learn the components that make up a distributed Dask cluster
  • See how to configure resources to meet your workload needs
  • Identify problems, debug, and troubleshoot successfully

Duration: Four Half-Days

Price: $1,600

June 28 – July 1September 20-23October 18-21November 8-11
10AM – 2PM PDT (UTC-7)10AM – 2PM PDT (UTC-7) 10AM – 2PM CEST (UTC+2) 10AM – 2PM PST (UTC-8)
Register Now

Prerequisites

  • Python, basic level
  • PyData stack (Pandas, NumPy, scikit-learn), basic level

Topics

Introduction

  • About Dask – what it is, where it came from, what problems it solves
  • Examples: one-line AutoML, Dask Dataframe, and custom parallelization

Parellelize Python Code

  • Fundamentals of parallelism in Python
  • concurrent.futures, Dask Delayed, Futures
  • Example: building a parallel Dataframe

Dask Dataframe

  • How Dask Dataframe works
  • Pandas-style analytics with Dask Dataframe
  • Applying custom computations to Dataframe

Dask Array

  • How Dask Array is related to NumPy and NDArray
  • Operations on Dask Array

Scaling Your Own Code

  • Basic scheduling
  • Example: Monte Carlo simulation
  • Visualizing task graphs
  • Dynamic scheduling

Graphical User Interfaces

  • Monitoring workers, tasks, and memory
  • Principal performance and troubleshooting challenges with big data
  • Using Dask’s dashboards to understand performance

Machine Learning

  • Scikit-Learn-style featurization with Dask
  • Algorithm support and integration
  • Modeling

Thinking About Distributed Deployment

  • About Dask and Coiled Computing: Making scale-out computing easy
  • Simplest distributed cluster: manual setup
  • Changes in transitioning to distributed environment
  • Implications for users (devs) and admin (IT)

Distributed Dask: Cast of Characters

  • Client, Scheduler, Nanny, Worker
  • Where these services are located, their relationships and roles
  • Supporting Players: cluster resource manager (e.g., k8s, Coiled Cloud, YARN, etc.)

Basic Operation of Dask Clusters

  • Creating clusters with helper tools: Cloud Provider, Coiled Cloud, etc.
  • Cluster API
  • Sizing and scaling your cluster
  • Admin perspective

Tasks

  • Submitting tasks and directing output
  • Scheduling policy
  • Finding your tasks and data (programmatically)
  • Seeing your tasks and data: the Dask Dashboard

Distributed Data

  • Source data via tasks
  • Source data scatter
  • Storing data worker-local
  • Handling output (result) data, direct parallel write vs. gather/result

Resource usage and Resilience

  • Output spill location and resource management
  • Work stealing
  • Loss of processes
  • Loss of storage on workers

Debugging

  • Additional GUIs (e.g., profiler)
  • Remote debugging
  • client.run

Use Case Example: Orchestrating Batch ML Scoring [optional, per timing]

  • Source data on disk
  • ML model
  • Options for interference, pros/cons
  • Supplying dependencies via code or container image
  • Basic workflow
  • Improvements and optimizations (e.g., batch size)

Best Practices

  • Managing partitions and tasks
  • File formats
  • Caching
  • Integrating with more Python (and non-Python) tools like xgboost, plotting libraries, and GPUs
  • Q&A