Is Dask production ready? (You bet it is.)
• April 7, 2021
You may have heard data scientists extol the virtues of scaling data science in Python with Dask. Their praises ring true: Dask works with all of the Python tools and libraries you love, providing an easy-to-use framework for parallel computations. With minimal code changes, you can process your data in parallel, which means it will take you less time to execute and give you more time for analysis.
That’s great for data analytics at scale, particularly when you are a data scientist whose analysis requires significant compute and you’re working on a single laptop. But what about a system in production? Is Dask production ready?
Development and Production in Data Science
First, let’s take a look at how we define development and production. In data science, a model or system in development would be one that you are building, validating, and testing in a highly iterative fashion. In development, it will be more flexible, and you can expect it to crash often.
A system in production is the state of deployment and maintenance of a model or models in their reliable, final configuration. Production workloads typically mean large, frequent jobs (think every hour) or jobs that run when a large dataset enters the system. An organization might deploy that system to provide a product or service. In production, a system must be reliably operating to meet customer requirements and your organization’s SLA (service-level agreement) standards.
Why use Dask in production?
Dask is useful in both development and production environments because you can scale the same computations you would typically process in-memory to larger tasks using Dask.
Dask accelerates your existing workflow with little to no code changes. It works with all of your favorite tools in the Pythonic ecosystem to create an easy-to-use framework for parallel computations across multiple cores, on a single workstation, or across multiple nodes in a cluster.
Dask is highly scalable, allowing you to design distributed computing architectures from a single node to thousand-node clusters, making it ideal for accelerating large-scale data analysis with parallel computing power. Production environments have different requirements.
Who uses Dask in production?
Data scientists and machine learning engineers across industries use Dask and other tools in the PyData ecosystem to process massive data at scale — both in development and in production.
Dask gives them the power to scale analysis and predictions using data that exceeds the processing power of their machines or systems. For example, they may be working with Python packages like pandas or NumPy, which are especially useful for data science but are not designed to process the massive data used in enterprise-level production environments.
Here are a few enterprise organizations that are using production deployments of Dask, Python, and other tools in the PyData ecosystem:
|WHO’S USING IT||DASK USE CASE||OUTCOME|
|Medical – Harvard Medical School researchers used Dask to do interactive image processing at scale.||They were able to view massive datasets in unique ways, including viewing the process of a cell undergoing mitosis.|
|Finance – Capital One uses Dask to scale Python workloads and reduce model training times.||Early implementations of Dask reduced training times by 91% within a few months of development effort.|
|Science – NASA uses Dask and the PyData stack to accelerate data analysis.||They used Dask and Xarray to develop an advanced Python API that applies scientific tools for preprocessing, regridding, machine learning, and visualization.|
|Retail – Walmart uses Python, Dask, and XGBoost to conduct weekly product forecasting of more than 500 million store-item combinations.||They are solving challenges like supply chain management and last-mile delivery to serve 265 million customers.|
5 Reliable Architectures for Production Workloads in Dask
How you deploy reliable architectures for production workloads in Dask will vary based on the nature of the requests you are sending to workers (i.e., machines) in Dask. In a September blog article for Coiled, Dask creator Matt Rocklin explored Dask in production and the key axes to consider in production workloads: request scale, request concurrency, response time, and resilience.
He outlined five different and highly reliable architectures for production workloads in Dask:
If you are working with: — your best bet is to build:
- Few, large requests — a Dask cluster per request
- Many small requests — a single-machine scheduler
- Many large requests — a Dask cluster per request
- Many large requests — a shared cluster
- Many large requests — a multi-scheduler
It is important to consider these variables as you design the architecture for the system you are deploying. There are tools, like Coiled, that you can use to manage Dask in the cloud, making it easier to spin up Dask clusters in the cloud.
Brought to you by the creators of Dask, Coiled scales your existing Python workflows and collaboration-ready Dask projects across diverse use cases, all in the Python-native environment you know and love.
Ready to learn more? Contact us and we will get in touch.