Architecting Large-Scale Data Systems with Dask

This class explores the best ways to leverage Dask within enterprise data architectures.

Most enterprises make heavy use of elements core to Dask (e.g., data manipulation and machine learning); activities external to Dask (e.g., using SQL for reporting and data extraction); and activities orthogonal to Dask but still critical to the success of the overall system (e.g., data storage). Moreover, staff and skillsets are often different across these areas.

We explore options and patterns for getting the best out of both Dask and non-Dask elements of the system.

Learn Ideas and Gain Skills

  • Where is Dask great? and Where do we need additional tools?
  • Integrating Dask with JVM-based data processing systems
  • Best practices to allow your data team to excel with the skills they know best

Dask Data Systems Python


Prerequisites

  • Python, basic level
  • JVM/Hadoop/Spark/Kafka ecosystem, basic level
  • Large-scale data storage patterns, basic level
  • Understanding of ML concepts and workflow, basic level

Topics

Introduction

  • About Dask and Coiled Computing: Making scale-out computing easier
  • Dask, the 30,000-foot view: scheduler, APIs, infrastructure support components
  • Photographic negative: what’s not in Dask
  • Overview of an integration flow from data warehouse to reports or ML models

Integrating Data

  • Finding your data: Hive metastore and alternatives
  • Ingesting data: formats and locations
  • Options for SQL access to data
  • Distributed caching
  • Consuming streaming data

Data Processing, ETL, and Feature Engineering

  • Dask support
  • Shuffling and other data transfer
  • External Python integrations
  • Custom functions / business logic
  • Checking compatibility vs. ANSI SQL, SparkSQL, HiveQL
  • Index pros/cons
  • ML Modeling – orientation/overview
  • Implementing custom algorithms

Data Output

  • Output artifacts
  • ML Models
  • Reports for human or business system consumption
  • ETL writes into another datastore
  • Output to a streaming or message-oriented middleware system
  • Transactional/safe writes – present and future

Additional Goals, Challenges, and Opportunities

  • Orchestration
  • Resilient streaming systems
  • ML serving/scoring systems
  • GPU / heterogeneous compute integration
  • Monitoring, management, and debugging interfaces
  • End-user notebook integration
  • Q & A