This class module focuses on Dask Bag, a functional-programming pattern for distributed computation over unstructured or heterogeneous data.
Dask Bag is useful for initial processing of unstructured text, large collections of heterogeneous business records which require special processing, images or diagrams, etc. The class focuses on functional style, the Bag API, and best practices.
Learn Ideas and Gain Skills
How Dask Bag applies your Python code to large data collections
Transforming, filtering, combining, aggregating and matching objects
Addressing performance concerns
Duration: half-day or full day
Python, basic to intermediate level
Some knowledge of functional programming is helpful but not required
Python functional constructs in the standard library
Why use a functional model for “big data”?
Dask Bag vs. local Python collections
Core Bag APIs and Operations
Ingesting data and creating Bags
Transform and project data with map
Understanding execution: Compute, Persist and Visualize
Builtin aggregations: math/stats, conditionals, counting, and sorting
Aggregate data with group and fold
Combine data with zip and join
Writing or retrieving output
General query improvement patterns
Minimizing expensive data transfer
Integration with from_delayed, to_delayed, to_dataframe