What is Dask?
Dask is a free and open source library that helps scale your data science workflows and provides a complete framework for distributed computing in Python.
It is used by researchers, data scientists, large corporations, and government agencies. Notably, Dask users include NASA, Harvard Medical School, Walmart, and Capital One.
Scaling Data Science Workflows
Dask integrates seamlessly with the PyData ecosystem making it easy to scale your NumPy, pandas, and scikit-learn code. Dask is purely Python-based and has a familiar API that makes onboarding quick and adoption simple.
Distributed Computing Toolbox
Dask can help scale up your computation to use all the cores of your workstations or scale out to leverage cloud services on AWS, Azure, and GCP. Dask is a distributed computing toolbox and has been used with multiple other Python libraries including XGBoost, Prefect, RAPIDS, and more.
Collections, Schedulers, and Workers
Dask collections provide the API used to write Dask code. Collections create task graphs that define how to perform the computation in parallel. The actual computation is performed on a cluster, which consists of:
- A scheduler, that manages the flow of work and sends the tasks to the workers
- Workers, that compute the tasks given to them by the scheduler
At the very beginning of the process, there is a client, that lives where you write your Python code. It is the user-facing entry point that passes on the tasks to the scheduler.
Dask provides a live interactive dashboard containing multiple plots and tables to help diagnose the state of your cluster. It includes:
- A cluster map, that visualizes interactions between the scheduler and the workers
- Task stream, that shows real-time activities performed by each worker
- Progress bar, that displays the progress being made on each task
Dask Task Stream
Coiled Cloud: Dask in the Cloud
Coiled Cloud makes it easy to scale your data science workflows to the cloud using Dask. Coiled Cloud handles DevOps so that you can focus on Data Science. It takes care of deploying containers, hooking up networking securely, managing Docker images, and more!