The Coiled Team May 4, 2021

5 Ways to Use Dask and Python to Scale Data Analysis

, , , ,


Scaling data analysis with Python has never been easier. Dask makes it easy to use pandas, Scikit-learn, NumPy, and other libraries in the Python ecosystem. However, as you apply your analysis to larger datasets, you’ll encounter challenges, as these tools were not designed to scale beyond a single machine. In the past, you’d have to use and learn new tools to scale your work.

Dask changed all of that, with APIs to scale workflows in the Python ecosystem tools you already use. It integrates by using their familiar APIs and data structures natively, so it requires minimal code changes on your part. Dask also evolves with these libraries to create a seamless transition from your local machine to a multi-core workstation, and further to a distributed cluster, if your workload requires it.

Here are five ways you can use Dask and Python to scale data analysis:

1) Iterate faster

Experimenting is important when you’re doing science with data because it is an iterative process, and you can discover new insights along the way. Of course, ultimately, you’ll put your model into production. And so you look for that just-right combination of data, features, and rules that solves your target problem in the real world. The faster you can iterate and evolve your approach, your outcomes are likely to be better, and you’ll get there faster.

Dask accelerates your existing workflow, which speeds your discovery and iteration process, so you spend less time executing and more time reviewing your model’s outcomes and tuning it to ensure high-quality outcomes.

2) Perform complex parallel computations

Dask can handle more advanced algorithms for statistics, machine learning, and time-series or local operations. It exposes low-level APIs to its internal task scheduler, which can execute advanced computations.

You can build your own parallel computing system using the same system that powers Dask’s arrays, DataFrames, and machine learning algorithms, while you apply your own business rules and logic. You can control the complex business logic while Dask handles network communications, load balancing, diagnostics, and other functions.

3) Track diagnostics at scale

With interactive parallel computing, it can be frustrating when you don’t know the progress of your computations, what is causing problems, and where to focus on improvement. Further, identifying and resolving bugs and performance issues can take longer when processes are running on compute power beyond your local machine.

Dask pulls back the curtain on the status of your computations, with helpful diagnostic and investigative tools. A dashboard shows real-time progress, communication costs, memory use, and other metrics of interest. Its statistical profiler, installed on every worker, polls each thread so you can see which lines of your code take the most time to process, across the entire computation.

You can see the state of your computation with a pop-up terminal, and Dask gives you the power to use the traditional debugging tools you’re used to, even if the error happens remotely.

Dask provides dashboards to view and diagnose the state of your cluster, workers, tasks, and progress. This image shows a screenshot of a dashboard in Dask. The dashboard has four modules with these names: Dask Graph, Dask Cluster Map, Dask Task Stream, and Dask Workers. Source: Coiled.io
Dask helps you track diagnostics at scale. Dask dashboards allow you to view and diagnose the state of your cluster, workers, tasks, and progress. Source: Coiled.io

4) Customize complex workloads

If your Python workloads are complex, Dask offers options to support them. Dask collections are composed of Python objects, so it is not difficult to map custom functions across partitions of a dataset without many changes. Dask also includes methods for more complex functions that require more advanced communication. For example, perhaps you want to communicate information from one partition to another or build a custom aggregation. Dask can do that.

If your workloads are particularly complex, you can convert your collections into individual blocks, and use Dask Delayed to arrange them to your specifications. There are more techniques for customization in Dask documentation.

5) Manage Dask in the cloud with Coiled

Coiled manages Dask for you in the cloud, making it easier to spin up Dask clusters and further accelerate your computations. With small code changes, you can run your models much faster. Coiled cluster tables allow you to track important cluster details, such as status, number of workers, configuration, and cost.

Coiled Cluster Table

This image shows a Coiled cluster table, which shows critical details about that cluster, including status (i.e., running or stopped), number of workers, configuration, and cost. Source: Coiled.io
A Coiled cluster table allows you to see and track critical cluster details, such as status, number of workers, configuration, and cost. Source: Coiled.io

If you are an individual, Coiled makes it possible for you to seamlessly scale your workflows to the cloud. As part of a team, you can use coiled to collaborate, share software environments, and track your costs. For enterprise, Coiled accelerates the power of a distributed architecture while observing best practices for security and compliance.

Supercharge the power of Dask with Coiled

Coiled makes Dask available to everyone, everywhere. Brought to you by the creators of Dask, Coiled scales your existing Python workflows and collaboration-ready Dask projects across diverse use cases, all in the Python-native environment you know and love.

Ready to learn more? Contact us and we will get in touch.