AWS computation costs roughly the following today:
On top of that different services charge a premium:
However, when you pre-commit to a large allocation then you can usually negotiate this down, and get extra perks like consulting hours.
So if you rent 100 4-core / 16 GiB machines for an eight-hour workday, and run Databricks
on them then your costs look like the following:
- To AWS:
100 machines * 8 hours * (4 cores * $0.04 + 16 GiB * $0.0125) =$288
- To Databricks: … 100% markup … = $288
- Total: $576
This is roughly the cost of a decently well paid data scientist, about $150k / year.
Reducing Cloud Costs
There are many things you could do to reduce these costs:
- Don’t leave your cluster on all day;
- Use spot/preemptible pricing and allow your worker machines to die if necessary;
- Profile your code or use smarter algorithms to reduce the need for so many machines;
- Use GPUs when they’re more cost effective (and don’t use them otherwise).
And yet the practice of leaving the cluster on all day is still common in many large companies today. This is because it’s often awkward and slow for data scientists to release old resources and request new ones frequently during their analysis. They also aren’t usually individually responsible for bearing costs.
So why haven’t companies made it easier to adaptively scale up and down and also use spot pricing by default?
Pricing Models Misalign Incentives
The standard way to price cloud compute today is a percentage tax on top of infrastructure costs, like the 100% premium from Databricks, or the 40% premium from SageMaker. This premium model incentivizes cloud-based companies to keep you running infrastructure, even when you don’t need it.
This is especially concerning for data science workloads which are typically bursty, where an analyst might use a large cluster to load in a dataset and produce a plot for two minutes, and then they stare at that plot for twenty minutes while their machines idle. This workload is common, and results in 90% waste, and yet no one here is incentivized to reduce that waste.
There are many alternatives out there, all of which induce pathological behaviors.
- Price per node: we charge you based on the number of machines on which you will install this software
Breaks because now on the cloud the whole point is that you’re able to burst out to lots of machines and don’t want to be charged for this maximum.
- Price per user charges you based on how many users have access to the system.
Breaks because this disincentivizes growth throughout the organization, and because we eventually learn to ask one poor individual to submit all of the queries.
- Price per task run or compute unit.
Breaks because it encourages people to pack in as much computation into giant tasks, which results in poor performances and breakages.
We don’t have a good solution to this problem. We’re still trying to work through it ourselves and find a solution that both helps us earn money while aligning our incentives with our users, rather than with the commercial cloud providers.
If you have thoughts we’d love to hear them. Feel free to get in touch.