Dask Heartbeat by Coiled: December 2021
• December 23, 2021
The Dask community is highly distributed, with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a monthly publication intended to centralize and broadcast Dask news over the previous month.
Improved Ordering Diagnostics
Erik Welch added some Dask diagnostic visualizations to better understand the optimization in dask.order. This is related to Erik’s to make dask.order compute dependent tasks eagerly, which allows the parent tasks to be released from memory giving performance benefits. Learn more about the visualizations in this elaborate example and look out for a blog post by Erik!
Cluster Memory Visualization
More diagnostics! Guido Imperiale has been working on the Active Memory Manager for Dask Distributed for the past few months. In line with this work, Guido added a new visualization called MemorySampler to compare the cluster-wide memory usage over time. Learn how to use it in the documentation.
High Level Graph Optimizations
There’s an ongoing long-term effort to add High Level Graph optimizations to Dask APIs. As a part of this,
- Ian Rose and James Bourbeau implemented these optimizations for Dask’s Delayed API;
- Gabe Joseph updated some Dask DataFrame join methods to use the high-level map_partitions function.
Reduced Memory Use while Merging Subframes
Adding to memory management improvements, Gabe Joseph updated how Dask DataFrame “subframes” are merged (before deserialization internally). The subframes are now merged without copying whenever possible, i.e., whenever they share a buffer, reducing the overall memory consumption.
Earlier this month, the Dask community held an Accessibility Sprint to improve the alt-text for images in the Dask documentation. This was organized by Sarah Johnson and Ian Rose in collaboration with Quantsight’s Isabela Presedo-Floyd and Tony Fast, and alt-texts for nearly 30 images were improved during the sprint!
Improved Support for Parquet
- read_parquet() can now read Spark-generated-outputs better using the fastparquet engine, and has improved implementation of the required_extension argument.
- to_parquet() now has a datafile_name_template argument which can be used to give custom filenames to output files.
Over the month of November, the following Dask packages were released:
- Dask and Distributed versions 2021.11.0, 2021.11.1, and 2021.11.2 (Note: Version 2021.11.0 included a critical regression for UXC comms which was reverted for the 2021.11.1 release);
- Dask-ML versions 2021.11.16 and 2021.11.30;
- Dask Gateway version 0.9.0.
Dask Monthly Community Meeting
Some highlights from the December Dask community meeting:
- The RAPIDS team will help improve the dask-sql project and add integrations with cuML and XGBoost.
- Coiled has a new “Community team” that will focus on supporting the Dask community. Look out for an official announcement soon!
Full meeting notes are available here.
You’re All Caught Up On Dask
That’s it. Thanks for reading.
If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.