How NextRoll cut costs by 70% switching from EMR to Coiled
Replacing Spark and Airflow with Dask, Coiled, and Dagster for efficient data pipelines
Introduction: Serving Ads at Scale#
NextRoll is a San Francisco-based ad tech company helping approximately 10,000 advertisers create brand awareness through digital advertising. Their platform participates in around 50 billion ad auctions daily, with just 100 milliseconds to make decisions about which ads to show and how much to bid.
On any given day we are participating in around 50 billion auctions on the web. Every time a person visits a website, we have around 100 milliseconds to respond with which ad to show and how much to bid.
Asif Imran
Senior Staff Machine Learning Engineer, NextRoll
Behind this massive-scale operation is NextRoll's Creative team, responsible for programmatically assembling ads and providing product recommendations. They face substantial computational challenges, working with thousands of clients whose product catalogs contain millions of items, while processing billions of user interactions to deliver the right product at the right time.
Imagine you have thousands of clients, each with a product catalog of around a million products. That's a big cardinality. And what is a product? It's not just an image URL, but it has metadata—price, categories, different tags the customers are interested in. And then we have historical logs of user interaction data telling us what customers have viewed and when.
This data-intensive operation requires joining massive datasets, processing them efficiently, and delivering recommendations fast enough to meet the demanding latency requirements of real-time ad bidding.
The Painful Status Quo#
Before adopting Coiled, NextRoll relied on a complex stack built around AWS EMR, Spark, and Airflow for their data processing pipeline. This created a significant disconnect between development and production environments.
A typical workflow would consist of writing code on a laptop, refining it, and then tossing it over the wall to productionalize it against a giant dataset. We were working in pure Python locally, but the production environment used PySpark with its own overhead and quirks.
For a small team of just five developers wearing many hats, this cognitive overhead was overwhelming. The most painful aspect was debugging production issues, especially during on-call hours:
The thing that really slowed us down was debugging during pager duty alerts in the middle of the night. The person on call might have done the prototyping in Python but had little experience with the Spark job that was eventually productionized. This naturally led to frustration.
The team realized they needed to simplify their approach and focus on what they were good at: Python data science. They were looking for a solution that would let them develop and deploy without having to master multiple technologies.
Finding a Better Way with Coiled#
The NextRoll team was already familiar with Dask as a scaling solution for pandas, but they needed a managed solution for deploying Dask clusters in production. That's when they discovered Coiled.
Security was a paramount concern given the sensitive nature of their customer data. Coiled's "bring your own cloud" approach meant that data would never leave their AWS VPC.
Data privacy was non-negotiable for us—our customer data couldn't leave our AWS VPC. We discovered that Coiled's approach was to run the workload for us while keeping everything in our environment. We fell in love with that workflow.
This security model made organizational adoption straightforward. When they approached their DevOps team with Coiled's Terraform code, the approval process was smooth:
When we told our DevOps engineers what we were about to do, they just asked, 'Is there something we can study?' We gave them our Terraform code and they said, 'This looks great. Nothing is leaking. Everything is being done locally in our AWS data center. Go for it.'
The Coiled Transformation#
The team started with a small prototype to evaluate Coiled's capabilities and were immediately impressed by the ease of spinning up infrastructure:
We were in love with Coiled's way of spinning off infrastructure. It was just Python decorator magic. We'd say, 'Please give us one VM,' and immediately see it appear in our AWS console. When we said, 'Can we 5x this?' again in real-time, five machines would come up.
This speed contrasted sharply with their EMR experience, which typically took 8-10 minutes to provision resources. But perhaps more impressive was Coiled's environment synchronization feature, which automatically replicated their local development environment in the cloud:
Coiled would introspect what's installed in our virtual environment and make it available in the cloud. We were amazed by how little effort it took to make things work at scale.
The transition from pandas to Dask DataFrames felt natural, and when they encountered challenges, Coiled's support team was responsive with solutions and new features.
Building a Complete Data Platform#
NextRoll didn't just replace their data processing framework—they transformed their entire data platform. Alongside implementing Dask and Coiled, they also replaced Airflow with Dagster for orchestration, creating a fully Python-native stack.
The combination of Dagster, Coiled, and Dask created a seamless experience that dramatically simplified their workflow. While their previous orchestration system felt like "a hodgepodge of bash scripts with a little bit of Python," the new stack allowed them to write consistent Python code across their entire pipeline.
You are writing native Python code all the way—be it on your Jupyter notebook, in your Python script, or when you put it all over on AWS. Nothing really changes, literally nothing. The only thing that changes is that, by the way, please use more than one machine or use really, really big machines. It forms a really nice triad—an orchestration framework, a framework to scale jobs, and a framework to process large data
Coiled's flexibility also made it easy to implement best practices through simple configuration:
There's another feature we discovered quickly—Coiled allows us to implement all the best practices we were used to. Our SREs would be livid if we weren't propagating tags. With Coiled, it was just adding another keyword argument: 'Please use these tags and spot instances where it makes sense.'
Measurable Impacts: Cost Savings and Beyond#
The switch to Coiled delivered immediate financial benefits:
Once we switched from AWS EMR to Coiled, we've seen our estimates drop by as much as 70%.
While impressive, this cost reduction wasn't actually the team's primary win. The biggest impact came from improved developer experience and operational stability.
When we were deploying things to production, we had a lot more confidence, because all of a sudden, what was on our local laptop didn't look starkly different from what was actually being deployed.
Even when alerts did occur, troubleshooting became much simpler:
Reasoning about failures became much easier. Instead of esoteric Spark JVM dumps that you could barely understand, we were working with familiar Python exceptions. Anyone moderately familiar with Python could identify the problem immediately.
This improved reliability stemmed from increased confidence in deployments. The development-to-production gap had closed significantly:
When deploying to production, we had much more confidence because what was on our laptop didn't look starkly different from what was being deployed. We could stress test by simply cranking up the data size and number of VMs.
But the most transformative impact was on the team's ability to experiment and innovate:
The biggest win was that developer happiness increased dramatically. If my developers are happy, they're prototyping more, experimenting more, and playing with new ideas. When you're doing 100 experiments, inevitably 10 or 15 of them become products.
The reduction in friction made the team more likely to try new approaches rather than being paralyzed by implementation complexity:
In the past, we'd be like, 'I have to write this script and then turn it into a Spark job and set up infrastructure? Never mind.' That's absolutely a killer for innovation.
The Power of the Coiled UI#
Beyond the core functionality, NextRoll found significant value in the Coiled dashboard for management and cost control. The UI provides real-time visibility into resource usage and costs, eliminating the guesswork that previously plagued their cloud spending.
The Coiled UI helps us keep tabs on our billing. At any moment, I can see how my teammates are using Coiled, and there's a column that literally says 'Here's the estimate for how much your workload costs.' In the past, we'd tear our hair out trying to estimate costs.
The team also leverages Coiled's integrated logging capabilities to simplify debugging and support:
We pipe our CloudWatch logs to Coiled, which helps us debug new workloads. If we run into an obscure issue, we can quickly call someone from Coiled, and it's easy for them to look up the logs themselves and identify the problem.
Looking Forward#
With their new data platform in place, the NextRoll team continues to explore new possibilities with Coiled. They've already expanded beyond Dask DataFrames to leverage Dask Futures for web crawling:
We have a service that crawls all of our advertisers' websites, so it's just ginormous. The previous system was just struggling—it could not make head or tail of how to process this. But with Futures and Coiled, it was just this much code, and I was like, 'Wow!'
They're also considering expanding Coiled usage to handle their product feed processing, which currently suffers from inefficient resource utilization:
We're experimenting with using Coiled to run jobs on demand, letting it terminate instances when they're done. That would allow us to scale out exactly as much as needed, when needed, without idle resources.
The most important thing is that developer experience should be delightful. If we are left dealing with infrastructure issues, we can't be prototyping new things or coming up with new ideas. Coiled and Dask allowed us to do just that. Experimentation is money.
Asif Imran
Senior Staff Machine Learning Engineer, NextRoll
By focusing on developer happiness and creating a seamless Python-native experience, NextRoll has transformed their data platform. The result is a more efficient operation with dramatically lower costs, fewer operational headaches, and a team empowered to experiment and innovate more freely.