Tabular Data Processing with Dask Dataframe

This class module focuses on Dask Dataframe, a simple model for working with tabular data that may be too large to fit in memory or to process on a single machine.

The Dask Dataframe class is recommended for engineers or data scientists who typically work with tabular (row/column) data and related tools, like SQL databases.

Learn Ideas and Gain Skills

  • How Dask Dataframe extends Pandas to larger datasets
  • How to select, filter, transform, and join data
  • Understand performance with partitioning and indexes

Dask Dask Dataframe SQL databases


Prerequisites

  • Python, basic level
  • Pandas and/or SQL, basic level

Topics

Introduction

  • Python and Pandas for tabular data
  • Limitations of Pandas
  • Dask Dataframe model
  • Key similarities/differences compared to Pandas

Core Dataframe APIs and Operations

  • Reading data
  • Selecting records and columns
  • Using indexing to select records
  • Filtering datasets
  • Combining datasets (joins, unions)
  • Custom functions
  • Aggregations and sorting (groupby, sort)
  • Custom aggregation
  • Window (rolling) operations

Data access

  • Read CSV and Parquet data and best practices for performant reading
  • Read JSON and text data with Dask bag
  • Read custom formats with Dask delayed
  • Writing data efficiently for future access

Best Practices

  • General query improvement patterns
  • Minimizing expensive data transfer
  • Launching work and preserving results with “persist”
  • Partition sizing