This class explores the best ways to leverage Dask within enterprise data architectures.
Most enterprises make heavy use of elements core to Dask (e.g., data manipulation and machine learning); activities external to Dask (e.g., using SQL for reporting and data extraction); and activities orthogonal to Dask but still critical to the success of the overall system (e.g., data storage). Moreover, staff and skillsets are often different across these areas.
We explore options and patterns for getting the best out of both Dask and non-Dask elements of the system.
Learn Ideas and Gain Skills
Where is Dask great? and Where do we need additional tools?
Integrating Dask with JVM-based data processing systems
Best practices to allow your data team to excel with the skills they know best
Duration: half-day or full day
Python, basic level
JVM/Hadoop/Spark/Kafka ecosystem, basic level
Large-scale data storage patterns, basic level
Understanding of ML concepts and workflow, basic level
About Dask and Coiled Computing: Making scale-out computing easier
Dask, the 30,000-foot view: scheduler, APIs, infrastructure support components
Photographic negative: what’s not in Dask
Overview of an integration flow from data warehouse to reports or ML models
Finding your data: Hive metastore and alternatives
Ingesting data: formats and locations
Options for SQL access to data
Consuming streaming data
Data Processing, ETL, and Feature Engineering
Shuffling and other data transfer
External Python integrations
Custom functions / business logic
Checking compatibility vs. ANSI SQL, SparkSQL, HiveQL
ML Modeling - orientation/overview
Implementing custom algorithms
Reports for human or business system consumption
ETL writes into another datastore
Output to a streaming or message-oriented middleware system