Process Large Datasets with Polars

Query 150GB of data in the cloud without managing infrastructure

Introduction#

Local machines often struggle with large datasets due to memory and network limitations. This example shows how to process a 150 GB dataset in minutes using Coiled Functions and Polars.

The calculation processes 150 GB of data in around 7 minutes, costing approximately $0.50. You can run it right now. You'll need the following packages:

pip install polars s3fs coiled

pip install polars s3fs coiled

Full code#

Then run this. If you're new to Coiled this will run for free on our account.

import polars as pl
import coiled

def load_data():
    # Define the S3 path and storage options and load the data into memory
    s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
    storage_options = {"aws_region": "us-east-2"}
    return pl.scan_parquet(s3_path, storage_options=storage_options)

def compute_percentage_of_tipped_rides(lazy_df):
    # Compute the percentage of rides with a tip for each license number
    result = (lazy_df
              .with_columns((pl.col("tips") > 0).alias("tipped"))
              .group_by("hvfhs_license_num")
              .agg([
                  (pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
              ]))
    return result.collect()

@coiled.function(
    vm_type="m6i.4xlarge",  # VM with 64GB RAM
    region="us-east-2",     # AWS region to match the data location
    keepalive="5 minutes",  # VM keepalive time for potential subsequent queries
)
def query_results():
    # Load and process the dataset to compute the results
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

# Run the query
result = query_results()
print(result)

import polars as pl
import coiled

def load_data():
    # Define the S3 path and storage options and load the data into memory
    s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
    storage_options = {"aws_region": "us-east-2"}
    return pl.scan_parquet(s3_path, storage_options=storage_options)

def compute_percentage_of_tipped_rides(lazy_df):
    # Compute the percentage of rides with a tip for each license number
    result = (lazy_df
              .with_columns((pl.col("tips") > 0).alias("tipped"))
              .group_by("hvfhs_license_num")
              .agg([
                  (pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
              ]))
    return result.collect()

@coiled.function(
    vm_type="m6i.4xlarge",  # VM with 64GB RAM
    region="us-east-2",     # AWS region to match the data location
    keepalive="5 minutes",  # VM keepalive time for potential subsequent queries
)
def query_results():
    # Load and process the dataset to compute the results
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

# Run the query
result = query_results()
print(result)

After you've run this code, we'll examine what happened section by section to understand how it works.

The Problem#

Processing large datasets locally presents several challenges:

Memory Limits: The 150 GB Uber-Lyft dataset exceeds most laptops' RAM
Network Bottlenecks: Downloading large datasets is slow
Data Egress Costs: Moving data from cloud storage to local machines incurs fees
Infrastructure Management: Setting up powerful compute resources is complex

You could set up cloud infrastructure manually, but that requires configuring cloud instances, managing Python environments, handling data transfer, monitoring resource usage, and cleaning up when done.

Instead, we'll use Coiled Functions to run our Polars queries directly in the cloud where the data lives.

The Solution#

Our approach consists of three main parts:

1. Loading Data#

We use Polars' lazy evaluation to scan Parquet files from S3 without loading everything into memory at once:

def load_data():
    s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
    storage_options = {"aws_region": "us-east-2"}
    return pl.scan_parquet(s3_path, storage_options=storage_options)

def load_data():
    s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
    storage_options = {"aws_region": "us-east-2"}
    return pl.scan_parquet(s3_path, storage_options=storage_options)

2. Query Processing#

We define a function that computes the percentage of rides with tips for each operator:

def compute_percentage_of_tipped_rides(lazy_df):
    result = (lazy_df
              .with_columns((pl.col("tips") > 0).alias("tipped"))
              .group_by("hvfhs_license_num")
              .agg([
                  (pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
              ]))
    return result.collect()

def compute_percentage_of_tipped_rides(lazy_df):
    result = (lazy_df
              .with_columns((pl.col("tips") > 0).alias("tipped"))
              .group_by("hvfhs_license_num")
              .agg([
                  (pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
              ]))
    return result.collect()

3. Cloud Execution#

We use the @coiled.function decorator to run our query in the cloud:

@coiled.function(
    vm_type="m6i.4xlarge",  # VM with 64GB RAM
    region="us-east-2",     # AWS region to match the data location
    keepalive="5 minutes",  # VM keepalive time for subsequent queries
)
def query_results():
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

@coiled.function(
    vm_type="m6i.4xlarge",  # VM with 64GB RAM
    region="us-east-2",     # AWS region to match the data location
    keepalive="5 minutes",  # VM keepalive time for subsequent queries
)
def query_results():
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

The decorator handles all infrastructure details:

Starting a powerful cloud VM
Running the VM in the same region as the data
Setting up your Python environment
Running your code
Returning just the results
Cleaning up automatically

Results#

With this approach, we achieved impressive results:

Processed 150 GB of data in ~7 minutes
Ran in the same region as the data (no egress costs)
Used a VM with sufficient RAM (no memory issues)
Avoided all infrastructure management

The VM started in about 2 minutes, and the actual computation took about 5 minutes. The memory utilization graph showed efficient use of the available RAM.

Memory utilization showing efficient data processing

Our query returned the percentage of rides that received tips for each service:

[
  ('HV0002', 0.08440046492889111),
  ('HV0003', 0.1498555901186066),
  ('HV0004', 0.09294857737045926),
  ('HV0005', 0.1912300216459857),
]

[
  ('HV0002', 0.08440046492889111),
  ('HV0003', 0.1498555901186066),
  ('HV0004', 0.09294857737045926),
  ('HV0005', 0.1912300216459857),
]

Next Steps#

Here are a few ways you could extend this example:

Try different VM sizes to see how they affect performance. For example, use a larger VM like m6i.8xlarge for even faster processing.
Process different datasets by changing the S3 path in the load_data function.
Modify the query to answer different questions about the data, such as analyzing ride patterns by time of day.
Use warmer VMs by increasing the keepalive parameter to avoid startup time for subsequent queries.

For processing even larger datasets, you could scale up to a more powerful VM:

@coiled.function(
    vm_type="m6i.16xlarge",  # VM with 256GB RAM
    region="us-east-2",      # Same region as data
    keepalive="1 hour",      # Keep VM alive longer
)
def big_query():
    # Process even larger datasets
    ...

@coiled.function(
    vm_type="m6i.16xlarge",  # VM with 256GB RAM
    region="us-east-2",      # Same region as data
    keepalive="1 hour",      # Keep VM alive longer
)
def big_query():
    # Process even larger datasets
    ...

Get started

Know Python? Come use the cloud. Your first 10,000 CPU-hours per month are on us.

Book a demo

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉