Dask Heartbeat by Coiled

Matt Rocklin November 30, 2020

, , , ,


Dask Heartbeat

Introduction

The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a bi-weekly publication intended to centralize and broadcast Dask news over the previous two weeks.  

If you want something added to this list either send an e-mail at info@coiled.io, or tweet and tag @dask_dev and we’ll try to include it.

COVID-19 Research with Dask

Dask-Cloudprovider gets Azure support

https://cloudprovider.dask.org/en/latest/azure.html

Previously Dask-Cloudprovider offered AzureML support, which was somewhat slow and involved a lot of machinery.  The newer raw Azure implementation is speedier and involves less infrastructure to manage. Credit to Jacob Tomlinson (NVIDIA) for this work.

RAPIDS Benchmarks accelerated with new A100 hardware

NVIDIA’s RAPIDS/Dask benchmarks are amazing.  They became more amazing recently with the new A100 cards. Josh Patterson (NVIDIA) outlines the growth in this blogpost.

Anaconda, birthplace of Dask, is hiring

(Coiled is too https://coiled.io/careers/)

Dask’s YouTube channel hits 1000 subscribers

There is educational content, including many short videos on features that you probably didn’t know existed, on Dask’s YouTube channel. 

Spanish Language BlazingSQL Webinar 

Micro-optimizing the Distributed Scheduler

John Kirkham (NVIDIA) has started doing profiling micro-optimizations of the scheduler in an effort to improve task throughput.  So far he has picked up some low-hanging fruit around hashable objects

Now that that is done he is starting to move towards Cythonization, ideally without leaving pure Python.  An early proof of concept PR is here:

New minimum Pandas/Numpy versions

There are open PRs for bumping Dask’s minimum supported version of Pandas to 0.25 and Numpy to 1.15

Credit to Julia Signell (Saturn Cloud) for this work.

Fast Graph Submission (without fusion)

Submission of large graphs from the client to the scheduler can result in uncomfortable delays after you’ve called `compute()` but before anything shows up on the dashboard.  This should now be resolved for DataFrames and arrays if you set the following configuration:

dask.config.set(optimization__fuse__active=False)

Although beware, this may have other performance implications in your code (which we’re working on).  Thanks to Mads Kristensen (NVIDIA) as well as many others.

Visualizing Large Workloads

The Dask community is seeking feedback on how to best visualize cluster activity for large workloads with many tasks and workers. Do you have thoughts on data visualization? Feel free to chime in at https://github.com/dask/distributed/issues/4260.

Wrapping Up

That’s it. Thanks for reading all.

If you’re interested in taking Coiled Cloud for a spin, you can do so for free today when you click below.