Scaling Out: Effective Cluster Computing with Distributed Dask

This class addresses the transition from working successfully on a single server or experimenting with a minimal cluster to achieving successful, reliable, repeatable use of larger Dask compute clusters. We focus on a deep dive into all of the critical components in a distributed Dask cluster, how they work together, and how you can configure them to maximize throughput and minimize costs.

Learn Ideas and Gain Skills

  • What components make up a distributed Dask cluster and what purposes they serve
  • How to configure cluster resources to meet your workload needs
  • How to identify problems, debug, and troubleshoot successfully

Cluster Computing Clusters Dask Distributed Dask


Prerequisites

  • Python, basic level
  • Dask programming, basic level

Topics

Introduction

  • About Dask and Coiled Computing: Making scale-out computing easier
  • Simplest distributed cluster: manual setup
  • Changes in transitioning to distributed environment
    • Storage, fast universal memory access, single shared executable
  • Implications for users (devs) and admin (IT)

Distributed Dask: Cast of Characters

  • Client, Scheduler, Nanny, Worker
  • Where these services are located, their relationships and roles
  • Supporting Players: cluster resource manager (e.g., k8s, Coiled Cloud, YARN, etc.)

Basic Operation of Dask Clusters

  • User perspective
  • Creating clusters with helper tools: Cloud Provider, Coiled Cloud, etc.
  • Cluster API
  • Sizing your cluster
  • Scaling your scaling – manual/automatic
  • Admin perspective
    • CLI: dask-scheduler and dask-worker
    • Managing the worker environment
    • Additional admin concerns (security, tagging, and costs)

Tasks

  • Submitting tasks and directing output
  • Scheduling policy
  • Finding your tasks and data (programmatically)
  • Seeing your tasks and data: the Dask Dashboard

Distributed Data

  • Source data via tasks
  • Source data scatter
  • Storing data worker-local
  • Handling output (result) data, direct parallel write vs. gather/result

Resource usage and Resilience

  • Output spill location and resource management
  • Work stealing
  • Loss of processes
  • Loss of storage on workers

Best Practices, Debugging

  • Dashboard information pages
  • Additional GUIs (e.g., profiler)
  • Review of best practices
  • Remote debugging
  • client.run

Use Case Example: Orchestrating Batch ML Scoring

  • Source data on disk
  • ML model
  • Options for inference, pros/cons
  • Supplying dependencies via code or container image
  • Basic workflow
  • Improvements and optimizations (e.g., batch size)

Q & A, Discussion