What is Dask?

Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. ~ Dask.org

Dask is a free and open source library that helps scale your data science workflows and provides a complete framework for distributed computing in Python. It is used by researchers, data scientists, large corporations, and even government agencies. Notable Dask users include NASA, Harvard Medical School, Walmart, and Capital One.

Scaling Data Science Workflows

Dask integrates seamlessly with the PyData ecosystem making it easy to scale your NumPy, pandas, and scikit-learn code. Dask is purely Python-based and has a Familiar API that makes onboarding quick and adoption simple.

scikit-learn ML algorithms being scaled
Logos of Apache Airflow, Pangeo, scikit-learn, Featuretools, Prophet, Iris, Prefect, PyTROLL, xarray, RAPIDS, dmlc XGBoost

Distributed Computing Toolbox

Dask can help scale-up your computation to use all the core of your workstations, or scale-out to leverage cloud services like AWS, Azure, and GCP. Dask is a distributed computing toolbox and has been used with multiple other softwares including XGBoost, Prefect, RAPIDS, and more.

Collections, Schedulers, and Workers

Dask collections provide the API used to write Dask code. Collections create task graphs that define how to perform the computation in parallel. The actual computation is performed on a “cluster”, which consists of

  • A scheduler, that manages the flow of work and sends the tasks to the workers, and
  • workers, that compute the tasks given to them by the scheduler.

At the very beginning of the process, there is a “client”, that lives where you write your Python code. It is the user-facing entry point that passes on the tasks to the scheduler.

Collections (Dask Array, Dask DataFrame, Dask Bag, Dask Delayed, Futures) create task graphs. Schedulers execute task graphs.

Dask Dashboards

Dask provides a live-interactive dashboard containing multiple plots and tables to help diagnose the state of your cluster. It includes a:

  • Cluster map, that visualizes interactions between the scheduler and the workers
  • Task stream, that shows real-time activities performed by each worker
  • Progress bar, that displays the progress being made on each task 

Coiled: Dask in the Cloud

Coiled makes it easy to scale your data science workflows to the cloud using Dask.

Coiled handles DevOps so that you can focus on Data Science. It takes care of deploying containers, hooking up networking securely, managing Docker images, and more!

Sign up for updates