Coiled.io

DataFrames

Cloud data processing with Pandas or Polars. Even DuckDB. Anything but Spark.

View Pricing

Polars or DuckDB

Done before you can ask AI how to configure Spark.

Run Polars or DuckDB queries in the cloud.

Run on a big cloud VM
Run independent queries in parallel on many VMs
Run on a schedule with Prefect or Dagster

Polars Example DuckDB Example

import polars as pl
import coiled

# Specify size of machines and cloud region
@coiled.function(
    cpu=128,
    memory="512 GiB",
    region="eu-central-1"
)
def run_query(filename):
    df = pl.read_parquet(filename)
    q = (
        pl.scan_parquet(filename)
        .filter(pl.col("balance") > 0)
        .group_by("account")
        .agg(pl.all().sum())
    )

    return q.collect

# Run queries in parallel on many machines
run_query.map(filenames)

import polars as pl
import coiled

# Specify size of machines and cloud region
@coiled.function(
    cpu=128,
    memory="512 GiB",
    region="eu-central-1"
)
def run_query(filename):
    df = pl.read_parquet(filename)
    q = (
        pl.scan_parquet(filename)
        .filter(pl.col("balance") > 0)
        .group_by("account")
        .agg(pl.all().sum())
    )

    return q.collect

# Run queries in parallel on many machines
run_query.map(filenames)

Parallel Python with Dask

Your favorite Python libraries, at scale.

Churn through tabular data

Load data from anywhere Pandas can
Scale out to 100s of TiB
Easily write custom logic on Pandas partitions

See Example

import coiled
import dask.dataframe as dd

cluster = coiled.Cluster(
    region="us-east-2"
    worker_memory="64 GiB",
)
client = cluster.get_client()

# Load Data
df = dd.read_parquet("s3://coiled-data/uber/")
df.base_passenger_fare.sum().compute()

# Query Data
df.driver_pay.sum().compute()

import coiled
import dask.dataframe as dd

cluster = coiled.Cluster(
    region="us-east-2"
    worker_memory="64 GiB",
)
client = cluster.get_client()

# Load Data
df = dd.read_parquet("s3://coiled-data/uber/")
df.base_passenger_fare.sum().compute()

# Query Data
df.driver_pay.sum().compute()

Easy and familiar API

You already know pandas - this leverages your expertise.

Pandas

import pandas as pd

df = df[df.value >= 0]
joined = df.merge(other, on="id")
joined.groupby("id").value.mean()

import pandas as pd

df = df[df.value >= 0]
joined = df.merge(other, on="id")
joined.groupby("id").value.mean()

Read Documentation

Dask Dataframe

import dask.dataframe as dd

df = df[df.value >= 0]
joined = df.merge(other, on="id")
joined.groupby("id").value.mean().compute()

import dask.dataframe as dd

df = df[df.value >= 0]
joined = df.merge(other, on="id")
joined.groupby("id").value.mean().compute()

Faster than Spark

... and less painful too!

Dask Dataframe easily beats Apache Spark on standard benchmarks like TPC-H. And your sanity remains intact.

Twice as fast, on average
Doesn't require intense configuration
Easier to debug (unless you love the JVM)

See TPC-H Report

Delightful to use

These people said nice things about us, and we didn't even have to pay them.

"My team has started using Coiled this week. Got us up and running with clusters for ad hoc distributed workloads in no time."

Mike Bell

Data Scientist, Titan

"On my computer this takes days. Now it takes an hour. I had no experience with distributed systems."

Mohamed Akbarally

Data Scientist, With Marmalade

"I've been incredibly impressed with Coiled; it's quite literally the only piece of our entire ETL architecture that I never have to worry about."

Bobby George

Co-founder, Kestrel

"Coiled is natural and fun to use. It's Pythonic."

Lucas Gabriel Balista

Data Science Lead, Online Applications

FAQ

That's the promise, but it's mostly a lie.

Dask dataframe is built of many pandas dataframes and it uses the same API, so it's really similar. In reality though, distributed cloud computing gets complicated and for full performance you'll encounter differences.

Fortunately, Dask's dashboard is there to help you through this. And you're smart enough to debug it when needed - we give you the tools to see what's happening.

Several Terabytes is easy. Hundreds is doable.

On the low-end, if your data fits in memory we recommend using Pandas or Polars. Don't scale if you don't need to. You're too smart to add complexity when it's not needed.

On the high-end, Dask scales to 100s of thousands of cores and 100s of TiB of data. Above 10 TiB things get interesting and require tuning, but we're here to help. The hardest data problems need your expertise plus our infrastructure.

Coiled just turns on cloud machines. You can run whatever you like.

We started with Dask code but the platform ended up being useful for lots of things.

You know which tool is right for your problem - we just help you scale it.

Actually, we do (albeit a little shamefully).

Again, Coiled can run anything. Here are the docs. We respect that you might have legacy Spark code that's still valuable.

Not officially, but effectively yes.

It's not an official offering (there are many excellent solutions for cloud SQL today) but we do have customers who combine Coiled Batch with Trino with good results.

Get in contact if you'd like to learn more.

Get started

Know Python? Come use the cloud. Your first 500 CPU hours per month are on us.

Book a demo

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉