Processing Unstructured Data and Dask Bag

This class module focuses on Dask Bag, a functional-programming pattern for distributed computation over unstructured or heterogeneous data.

Dask Bag is useful for initial processing of unstructured text, large collections of heterogeneous business records which require special processing, images or diagrams, etc. The class focuses on functional style, the Bag API, and best practices.

Learn Ideas and Gain Skills

  • How Dask Bag applies your Python code to large data collections
  • Transforming, filtering, combining, aggregating and matching objects
  • Addressing performance concerns

Bag API Dask Bag Unstructured Data


Prerequisites

  • Python, basic to intermediate level
  • Some knowledge of functional programming is helpful but not required

Topics

Introduction

  • Python functional constructs in the standard library
  • Why use a functional model for “big data”?
  • Dask Bag vs. local Python collections

Core Bag APIs and Operations

  • Ingesting data and creating Bags
  • Understanding partitions
  • Transform and project data with map
  • Understanding execution: Compute, Persist and Visualize
  • Filter data
  • Builtin aggregations: math/stats, conditionals, counting, and sorting
  • Aggregate data with group and fold
  • Combine data with zip and join
  • Writing or retrieving output

Best Practices

  • General query improvement patterns
  • Minimizing expensive data transfer
  • Integration with from_delayed, to_delayed, to_dataframe
  • Partition sizing