Who uses Dask?

by Matthew Rocklin

Motivation

People new to dask often ask “Who uses Dask?”

They typically mean one of two questions:

  1. Do many people use Dask?
  2. Do people in my field use Dask?

The answer to both questions is “definitely”. Dask users are numerous and varied. The user community spans a wide range of applications and professions. This post tries to paint a very rough sketch of Dask’s current user base.

Scalable Data Science in Python

The majority of Dask users are professionals who …

  1. Have scalability problems, such as having more data than they can process on their computer or want to build bigger models, and
  2. Use Python.

Dask can be a good fit for anyone with these criteria. As we’ll see this includes data folk working in finance, geosciences, biomedical research, urban planning, and astrophysics, among many other verticals.

Do many people use Dask?

Download Counts

Yes. Dask has grown in an organic grass-roots way over the last five years. As a result the Dask community is far reaching, and has roots in most data communities today. Quantifying the adoption of a community-driven open source project is notoriously hard. Last February we looked at download counts and page views and decided that even though Dask has 100k daily downloads, the truer user count is probably closer to 15,000 (honest metrics are hard to do well).

Judging from questions on Stack Overflow and issues posted on Github issue trackers, these users come from every conceivable industry and scientific discipline where Python is heavily used.

Python Survey

Additionally, about 5% of Python users use Dask (according to the Python user survey). This is less than a project like Apache Spark, but more than a project like Apache Hive.

Do people in my field use Dask?

If your field solves data-intensive problems and prefers Python, then the answer is “almost surely”.

The easiest way to break up who uses Dask is to talk about how people use Dask, and list sectors and companies within each common usage pattern.

Data Science with Pandas and Scikit-Learn

Many traditional data science sectors use Dask with data science libraries like Pandas, Scikit-Learn, nltk, spacy, and others. These include companies in banking and finance, like Capital One and JP Morgan Chase, insurance, like All State and State Farm, government and city planning, like the US Government, and the NYPD.

Timeseries analysis

Other groups in finance (like Blackrock, TD Ameritrade), heavy industry (like Caterpillar), automotive (like Tesla), and logistics (like JDA/Blue Yonder) use Dask to analyze large volumes of telemetry to reduce costs and plan daily activities.

Array analytics with Numpy

Dask is unique among data analytics tooling in that it can operate on gridded data structures in 2d (like satellite or microscope images), 3d volumes (like fluid dynamics simulations or MRI scans), 4d volumes (like climate simulations) at scale.

This has brought a huge set users in

  • Remote sensing, like NASA/ESA
  • Meteorology like NCAR, the UK Met office, the European Centre for Meteorological Weather Forecasting (and pretty much every other European and Australian weather service
  • Hydrology, like USGS, or the UK Hydrographic Office
  • The various technology companies that surround this space, like Element84 and Radiant Earth
  • Microscopy, like Howard Hughes Medical Institutes and the Chan Zuckerberg Initiative
  • Large energy companies, like Shell and Shlumberger
  • Department of Energy National Labs, like Los Alamos, Brookhaven, and NERSC

And many, many more.

Machine Learning

Through a multi-year collaboration with the Scikit-Learn developers at Columbia University, Inria, and other institutions, as well as a large backing by NVIDIA (a sizable number of Dask developers are employed by NVIDIA), Dask has established itself as a pragmatic choice for traditional machine learning. In practice these are only really necessary for larger institutions with very large datasets, like Capital One and Walmart, for whom the native integration with XGBoost has been particularly helpful.

Bespoke parallelism

The fact that Dask exposes its internal task scheduler to users has made it incredibly valuable for institutions with highly complex workloads, especially those that change rapidly. This makes Dask a good fit for quantitative trading firms, as well as credit lenders like Capital One and Barclays.

Domain Specific Solutions

Bespoke parallelism also enables a number of other projects which have spun up around Dask, each of which brings in their own set of users. So to answer the question of “who uses Dask?” we also have to ask “who uses … “

  • Prefect (workflow management)
  • Napari (image viewer)
  • FeatureTools (ML feature engineering)
  • Stumpy (time series)
  • TSFresh (time series)
  • SatPy (remote sensing)
  • BlazingSQL (SQL on GPUs)

And this is the really exciting trend of where we see Dask going today. Increasingly we see people build more domain specific systems on top of Dask, and that brings new verticals into the fold. Practitioners from a specific domain are able to use Dask to build scalability directly into a software solution that is just right for their audience. They understand how their users think and what they want much better than the Dask core team ever could.

Learn More

I’ll be speaking more about this at AnacondaCon in June. Sign up for the Coiled email list below and we’ll share the video with you once it’s live!


Want to stay up to date?