Dask Heartbeat by Coiled: September 2021
• September 17, 2021
The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a monthly publication intended to centralize and broadcast Dask news over the previous month.
Performance of Parquet with Dask
Recently, the community noticed some challenges with processing large datasets stored in the Parquet file format on Amazon S3 buckets. These include issues around crossing the S3 request rate limits, difficulty parsing large metadata files, etc. James Bourbeau, Joris Van den Bossche, Martin Durant, and Richard (Rick) Zamora, are leading the effort to reproduce and fix these issues. You can follow the conversations here.
Dask + Google Summer of Code 2021
As a part of GSoC’21, Freyam Mehta improved task graph visualizations and extended the HTML representations for Dask high-level graphs. Task graphs now show the size of each layer and can be color-coded according to layer type. Dask also has new HTML representations in Jupyter Notebooks for the Process Interface and Security classes. Check out the project outcomes and more details in this blog post.
Optimizing High-Level Expressions
Aaron Meurer and Jim Crist-Harif are exploring how to optimize high-level queries and expressions. As Matthew Rocklin mentioned in the corresponding issue, the idea here is to understand user intent and rewrite their code to improve performance with Dask.
Dask DataFrame ORC API
Richard (Rick) Zamora worked on refactoring and expanding the Dask DataFrame ORC API. Dask DataFrame now has better support for reading and writing to ORC files, including support for working with partitioned datasets.
Update on HTML Representations
In the ongoing effort to improve and add more HTML Representations to Dask, Jacob Tomlinson migrated HTML representations to use jinja2 templates. This migration makes it easier to create and maintain these representations. It also makes this tooling more accessible to projects that are downstream of Dask.
Dask Monthly Community Meeting
Some highlights from the September Dask community meeting:
- The NVIDIA team released a new package CUDA Python last month. It “provides Cython bindings and Python wrappers for the driver and runtime API for existing toolkits and libraries to simplify GPU-based accelerated processing.” ~ What is CUDA Python?
Full meeting notes are available here.
You’re All Caught Up On Dask
That’s it. Thanks for reading.
If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.