Dask Heartbeat by Coiled
• November 30, 2020
The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a bi-weekly publication intended to centralize and broadcast Dask news over the previous two weeks.
COVID-19 Research with Dask
Dask-Cloudprovider gets Azure support
Previously Dask-Cloudprovider offered AzureML support, which was somewhat slow and involved a lot of machinery. The newer raw Azure implementation is speedier and involves less infrastructure to manage. Credit to Jacob Tomlinson (NVIDIA) for this work.
RAPIDS Benchmarks accelerated with new A100 hardware
Anaconda, birthplace of Dask, is hiring
(Coiled is too https://coiled.io/careers/)
Dask’s YouTube channel hits 1000 subscribers
There is educational content, including many short videos on features that you probably didn’t know existed, on Dask’s YouTube channel.
Spanish Language BlazingSQL Webinar
Micro-optimizing the Distributed Scheduler
John Kirkham (NVIDIA) has started doing profiling micro-optimizations of the scheduler in an effort to improve task throughput. So far he has picked up some low-hanging fruit around hashable objects
Now that that is done he is starting to move towards Cythonization, ideally without leaving pure Python. An early proof of concept PR is here:
New minimum Pandas/Numpy versions
There are open PRs for bumping Dask’s minimum supported version of Pandas to 0.25 and Numpy to 1.15
Credit to Julia Signell (Saturn Cloud) for this work.
Fast Graph Submission (without fusion)
Submission of large graphs from the client to the scheduler can result in uncomfortable delays after you’ve called `compute()` but before anything shows up on the dashboard. This should now be resolved for DataFrames and arrays if you set the following configuration:
Although beware, this may have other performance implications in your code (which we’re working on). Thanks to Mads Kristensen (NVIDIA) as well as many others.
Visualizing Large Workloads
The Dask community is seeking feedback on how to best visualize cluster activity for large workloads with many tasks and workers. Do you have thoughts on data visualization? Feel free to chime in at https://github.com/dask/distributed/issues/4260.
That’s it. Thanks for reading all.
If you’re interested in taking Coiled Cloud for a spin, you can do so for free today when you click below.