Getting Started with Coiled: 7 User Tips for Success
• July 16, 2021
Over the past few months, I’ve had the chance to work directly with a number of different customers and onboard them into Coiled. Some are small teams working with Dask dataframes on exploratory data analysis workflows. Others are larger teams working with custom parallel code using Dask Delayed for production data engineering pipelines and model training workflows.
We’ve learned a lot together along the way in terms of how teams can move from their current state (slow ETL or data transformations, complex infrastructure to maintain, high cloud computing costs), to their ideal state with Dask and Coiled (faster analysis and iterations on their projects, working with clusters on-demand, cost savings, easy to scale up and down).
This post reflects on common, repeatable lessons that we’ve learned along the way, both in terms of how customers onboard into Coiled and validate that it’s a good fit for their needs as well as how the people and tools at Coiled can help with this process. If you’re wondering whether Dask and Coiled can speed up your pipelines and workflows and where to start, then you’ll be in good company by following some or all of these tips when taking Coiled for a test drive.
Begin with the quick start
The quick start steps in the Coiled dashboard or the Getting Started guide in the Coiled documentation are great places to start with Coiled. These examples run through the basic steps of installing the Coiled Python client, creating a cluster, and running a sample calculation. Once you work through the quick start steps yourself, you’ll have all of the fundamentals and tools necessary to scale up your Dask workloads easily as described in the following tips.
Develop a meaningful representative example
While the quick start example is a great way to get started with Coiled, it’s likely that you want to run your own Dask code on Coiled and see how it performs at scale. You’re probably familiar with the importance of building a minimal working example when you encounter a bug or unexpected behavior with a particular tool.
Similarly, when working with customers on a proof of concept with Dask and Coiled, we’ve found that it’s useful to develop a meaningful representative example that will help you understand how Dask and Coiled can help speed up your data engineering pipeline or model training workloads and get a quick performance win along the way without worrying about the details of starting with thousands of lines of code.
For example, you might have a huge Python codebase that makes up your entire data science pipeline for a given project. It can seem difficult to think about where to start. If you’re already using Dask in most of that code, then great! You should be able to add a couple of lines of code from our getting started guide and see what kinds of speedups you get from running on a large Dask cluster in Coiled.
If you are not already using Dask in that project or if it’s a work in progress, this is a good chance to profile your code and discover the most demanding parts of your pipeline. Maybe it’s loading large Parquet files from S3, or large filter and groupby operations, or hyperparameter optimization/tuning.
Whatever your use case or bottleneck might be, you can build a meaningful representative example that loads some data from your most commonly accessed data source, performs some operations related to your use case, and writes the results as you normally would. Then try running this on progressively larger Dask clusters in Coiled (scaling up and down is easy in Coiled!) and observe the scaling behavior. Once this process works for your minimal example, you can approach other parts of your code until your entire project runs to your liking, then move on to other projects and use cases and do the same!
Understand your parallel performance and scaling behavior
Coiled provides functionality to generate performance reports using the same functionality in Dask, then uploads the performance report to Coiled. This makes it easy to share performance reports with other members of your team or with our Dask experts at Coiled without having to email or send around the report/file.
In working with customers, we’ve found that performance reports are a great way to get a detailed measurement of how well your code performs on a distributed Dask cluster, and if your code scales as much as you would expect it to on clusters with 10 workers, 50 workers, 100 nodes, and beyond. Performance reports are also a great way to engage with our Dask experts here at Coiled in terms of identifying ways to improve the performance of your Dask code and how well it runs on Coiled.
Explore other Coiled functionality
Did you know that Coiled can make use of your existing Docker images and make them available to all of the Dask workers in your Coiled cluster? Or that Coiled makes it really easy to work with GPUs? Or that you can configure Coiled to run entirely within your cloud account?
After you’ve worked with Coiled and run some Dask computations, this is a good time to explore some of the next steps in the Coiled documentation. In our experience working with customers, they are always happy to discover many of our best practices and learn about what else they can do with Coiled!
Work together within your colleagues on a team
When you register for a free Coiled account, you’ll have your own personal account and some compute credits to try things out with. Since Coiled is better with friends and colleagues, this is a great point to reach out to us, and we’ll create a Team account for you where you can add other coworkers to your team, set team-level and user-level quotas, and work with shared Dask clusters, software environments, cluster configurations, and performance reports.
Working within a Team account is a great way to split up work and have different people work with different projects or even different components of a single project. In our experience, having more team members evaluating a proof of concept with Coiled is more meaningful since people tend to evaluate certain parts of the system that they are more familiar with, whether that’s the high-level workflow/pipeline orchestration, the low-level task executions, or benchmarking/optimizing for certain metrics such as cost or cluster size.
Configure Coiled to run on your own cloud account
You can configure Coiled to create and manage Dask clusters that run entirely within your own cloud account, such as your own AWS account. This lets you make use of Dask clusters that make use of security/data access controls, compliance standards, and promotional credits that you already have in place within your cloud account.
When working with customers, we realize that most organizations have IT and InfoSec requirements and policies for running new services in their cloud accounts. While we are always happy to work with your IT, DevOps, and infrastructure team to review Coiled’s architecture and security model, you probably want to perform a smaller proof of concept before going through a formal security review of Coiled.
In this case, you can use your own AWS subaccount (separate from your production account) or even your personal AWS account and follow the same configuration steps to configure Coiled to run on your AWS account. This is a great way to review the necessary cloud permissions and observe the cloud resources that Coiled manages for you, and it often helps to be able to explain or demonstrate Coiled to your own IT and InfoSec teams as you move through the architecture review and security/compliance checks.
Reach out and let us know we can help
Whether you’re a team of 1, 10, 100, or more working with Dask and Coiled, we’re always excited to hear from users. Whether it’s getting a second opinion on your parallel performance from our Dask experts, or getting help running a large job on a Coiled cluster, we’re here to help!
In our experience working with customers, the best time to prototype or take Coiled for a test drive is not when you have things running perfectly on your end with Dask, but rather when you’re working with all of the different parts of your project and looking for quick wins and guidance on whether your parallel performance or scaling approach makes sense. Feel free to reach out to us with product feedback, support questions, or just let us know how things are going with Coiled!
Thanks for reading
Thanks for reading this post! We hope it helps you have an awesome start on your journey with Coiled. If you haven’t had the chance to give Coiled Cloud a try yet, you can do so for free when you click below.