Dask Overview

This technical, hands-on overview class offers a brief but precise look at the many aspects of deploying and programming Dask, beginning with fundamentals of parallelism and Dask Dataframe, a model for large-scale tabular data based on extending Pandas. The class also explores visualizing the system using Dask’s real-time, animated dashboards.

Dask Overview is recommended as a standalone class or as the first day of a deeper, multi-day exploration of Dask.

Learn Ideas and Gain Skills

  • How Dask fits into the Python and big data landscape
  • How Dask can help you process more data faster, from a laptop up to a big cluster
  • Get started coding with Dask
  • Analyze data and train ML models with Dask

Dask Python


Prerequisites

  • Python, basic level
  • PyData stack (Pandas, NumPy, scikit-learn), basic level

Topics

Introduction

  • About Dask – what it is, where it came from, what problems it solves
  • Options for setting up and deploying Dask

Parallelize Python Code

  • Fundamentals of parallelism in Python
  • concurrent.futures
  • Dask Delayed, Futures
  • Example: building a parallel Dataframe

Dask Dataframe

  • How Dask Dataframe works
  • Partitions, reading and writing data
  • Pandas-style analytics with Dask Dataframe
  • Applying custom computations to a Dataframe

Dask Array

  • How Dask Array is related to NumPy NDArray
  • Operations on Dask Array

Dask Graphical User Interfaces

  • Monitoring workers, tasks, and memory
  • Principal performance and troubleshooting challenges with big data
  • Using Dask’s dashboards to understand performance
  • Debugging and profiling user code

Machine Learning

  • Scikit-Learn style featurization with Dask
  • Algorithm support and integration
  • Modeling

Best Practices

  • Managing partitions and tasks
  • File formats
  • Caching
  • Integrating with more Python (and non-Python!) tools like xgboost, plotting libraries, and GPUs
  • Q & A