What is Dask?

Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. ~ Dask.org

Dask is a free and open source library that helps scale your data science workflows and provides a complete framework for distributed computing in Python. It is used by researchers, data scientists, large corporations, and even government agencies. Notable Dask users include NASA, Harvard Medical School, Walmart, and Capital One.

Scaling Data Science Workflows

Dask integrates seamlessly with the PyData ecosystem making it easy to scale your NumPy, pandas, and scikit-learn code. Dask is purely Python-based and has a Familiar API that makes onboarding quick and adoption simple.

scikit-learn ML algorithms being scaled
Logos of Apache Airflow, Pangeo, scikit-learn, Featuretools, Prophet, Iris, Prefect, PyTROLL, xarray, RAPIDS, dmlc XGBoost

Distributed Computing Toolbox

Dask can help scale-up your computation to use all the core of your workstations, or scale-out to leverage cloud services like AWS, Azure, and GCP. Dask is a distributed computing toolbox and has been used with multiple other softwares including XGBoost, Prefect, RAPIDS, and more.

Collections, Schedulers, and Workers

Dask collections provide the API used to write Dask code. Collections create task graphs that define how to perform the computation in parallel. The actual computation is performed on a “cluster”, which consists of

  • A scheduler, that manages the flow of work and sends the tasks to the workers, and
  • workers, that compute the tasks given to them by the scheduler.

At the very beginning of the process, there is a “client”, that lives where you write your Python code. It is the user-facing entry point that passes on the tasks to the scheduler.