Coiled Runtime: Toward Using Dask in Production
• September 8, 2022
The popularity of the PyData ecosystem in the fields of data science and scientific computing is well known. Dask was created to help free these tools from their original constraints, so they could be applied more easily to big data problems. Since its inception, development of Dask has progressed at a rapid pace, such that in the last several years new versions have been released at a twice monthly cadence. This aligns with the mantra of “move fast and break things,” and in doing so, helped drive user engagement. As a Dask user, it’s satisfying to see bugs being fixed quickly, and new features coming into the project.
With their rising popularity, these tools have also made their way into production systems, where breaking things is itself, a deal-breaker. Maintainers of mission-critical systems demand stability and reliability, and don’t like getting calls in the middle of the night that their mission-critical workload has failed. They expect these to simply work, which creates an understandable tension between the need for rapid iteration and production-grade stability. It was this tension that prompted conversations in the Dask community about how to reconcile the needs of users who want to leverage Python and Dask in production, against the desire to keep development velocity high.
One of the suggestions for resolving this tension included creating blessed versions of Dask. Though this garnered input from a wide cross-section of the Dask users, challenges hindered progress. Unanswered questions included concerns about proper criteria to qualify a release as “blessed”, whether there is even a need for a formal validation process, and the possibility of backporting bug fixes. The community has also raised the possibility of introducing integration tests and release candidates as approaches to identify stable versions of Dask as well, but none of these have gained traction.
The Next Step: Integration & System Testing
Both Dask and Distributed have robust suites of unit tests, with more than 90% coverage. These tests serve their intended purpose, assessing separable units of their respective projects for correctness, independent of the larger project, thus assuring each unit performs as expected. This is foundational to the concept of test-driven development, and one step to building production-grade solutions. However, it is not the only component.
Circling back to the mantra of “move fast and break things”, the State of DevOps 2021 teaches us that the elite software engineering teams deploy 973X more frequently than low performing teams. Clearly these teams are moving fast. But it’s also true that changes made by these teams are 1/3 less likely to fail, and they recover from incidents 6570X faster than their low performing counterparts. While they move fast, they are also less likely to break things. While there are surely many reason for this difference, at least one contributor is that they are over 4x more likely to implement practices around monitoring and observability.
This suggests that, while unit tests are a great start, they are not sufficient for assessing production-grade deployments. This is because software runs on hardware that must be configured properly, and software systems must operate well together to give the best user experience. If we never use them together, we can’t understand how they behave. Hence, traditional software testing includes evaluating the performance of subsets of the overall system in combination (integration tests), and assessing the functionality of all components of the system (including hardware) when deployed together (system tests). Putting these additional concepts into practice allows developers to evaluate the impact of hardware configuration on performance, detect regressions that are only apparent when running at scale, identify unintended regressions, and assure all of the components of the platform work together to provide the best possible experience for new and experienced users.
The Impact Of Improved Observability
One pleasant aspect of getting started with Dask is the ease of setup on a single machine. New users can get up and running with a sophisticated distributed compute platform easily, particularly if they’re using data stored locally in traditional file formats. But getting the most out of Dask for large-scale, big data problems typically requires running it in combination with other tools from the PyData ecosystem. Assuring that users have a curated Python environment whose performance has been validated on well-defined hardware is one step along the path to production readiness.
This fact is reflected in the composition of the Coiled Runtime, which is the Python environment the Coiled team uses for integration and system testing Dask. To assure users wanting to use Dask in production have a great experience, we specify versions of packages like S3Fs, Zarr, PyArrow, Xarray (and others), and validate they work as expected when used with Dask. Subsystems of Dask, and representative workloads are run on infrastructure provisioned by Coiled, allowing Dask developers to assess how changes to the full compute environment impact performance across a variety of workloads. By monitoring performance in this way, we can quickly detect regressions, monitor Dask’s performance over time, and evaluate the impact of changes on a broad array of workloads.
Users that are interested in learning more about this work are invited to check out the publicly available Benchmarks page of the Coiled Runtime repository. Over the last several months, as the Dask Engineering team at Coiled has adopted these practices, we’ve started adding tests like the stability test to protect against unintentionally introducing deadlocks, tracking CI failures back to upstream packages that inadvertently break Dask, and validating that popular packages like xgboost works well with dask, thereby assures data scientists have a good experience when getting started with Dask.
Integration tests for features like adaptive scaling, which itself depends on work-stealing, provides both visibility into potential performance changes and confidence that the two systems will work together as expected. We’ve also added tests for critical subsystems like Dask’s spill/unspill functionality, which helped pinpoint an error in how the amount of data spilled to disk is calculated. Completing this work enabled fixing how memory limits were being set on Coiled, which in turn provides users with a much better experience.
In the spirit of embracing systems tests, the team has also started running benchmarks for common geospatial and dataframe operations at scale, to validate Dask’s performance against a variety of common workloads. A few examples of how this is paying dividends for users can be found below.
The team planned to concurrently release a new version of the Coiled Runtime alongside the Dask 2022.8.1 release. During the initial stages of the release process, we detected a spike in peak memory usage while running Parquet I/O tests, as show in the plot below. Given this information, we were able to quickly identify the offending PR and create a fix for this regression, thereby ensuring Dask users continue to have a good experience when working with parquet datasets.
Identifying Breaking Changes
In another example of how this work has benefited Dask users, we can see the impact of properly configuring MALLOC_TRIM_THRESHOLD_. In July of this year, this value was set to 65536 in an attempt to improve Dask worker memory usage, but had the unintended consequence of breaking dask-cuda. We reverted the change, observing memory usage increase, as shown in the graph below. A few weeks later, we merged an alternative way to set MALLOC_TRIM_THRESHOLD_ that did not impact dask-cuda, and memory usage again declined. The plot below clearly shows the impact of these changes on cluster-wide memory usage through these transitions. In late August, we implemented a change that allows fusing of compatible annotations when optimizing Blockwise layers in the task graph. This further reduced cluster-wise memory usage when working with parquet data.
While working on scheduling enhancements to address root task overproduction, engineers identified several geospatial workloads that benefited significantly from this change. These promising results prompted the team to go forward with merging the improvement, which provide solid improvements to Dask Array users. Shortly afterward, system tests flagged a substantial regression in a common DataFrame operation that was an unintended consequence of the above change. We were able to revert this commit, thereby limiting the impact to users while a more robust solution can be identified.
Focusing On User Outcomes
As the Dask ecosystem continues to see usage increases in production workloads, expectations around stability and consistency of performance will necessarily rise. In the instances described above, users benefited from the capability to evaluate Dask’s performance across a broad set of use cases and quickly detect performance regressions or breaking changes before users were impacted. This work provides a better overall experience for the entire community, and improves user confidence in the platform. Moving forward, we plan to embrace the notion of data-driven development by assuring significant changes are evaluated holistically