Process NASA Earth Data

Analyze sea surface temperatures without downloading terabytes

Sea Surface Temperature analysis of Great Lakes

Introduction#

NASA's Earth science repositories contain petabytes of climate and environmental data that traditionally require downloading before analysis—a slow and expensive process.

This example shows how to analyze sea surface temperature data directly in the cloud, examining temperature patterns in the Great Lakes region. By running your computation where the data lives, you'll eliminate hours of downloading time and drastically reduce costs.

The analysis processes 500GB of data in just 9 minutes instead of 6+ hours, at a fraction of the cost. You'll need the following packages:

pip install coiled earthaccess xarray numpy matplotlib

pip install coiled earthaccess xarray numpy matplotlib

Full code#

Run this code to analyze sea surface temperature variations. If you're new to Coiled, this will run for free on our account.

import coiled
import os
import tempfile
import earthaccess
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

# Step 1: Set up Earthdata authentication
# You'll need a free NASA Earthdata account (https://urs.earthdata.nasa.gov/)
earthaccess.login()

# Step 2: Find the dataset files we want to analyze
granules = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",  # Sea Surface Temperature dataset
    temporal=("2020-01-01", "2021-12-31"),  # Two years of data
)

# Step 3: Create a function to process each data file
@coiled.function(
    region="us-west-2",                  # Run in the same region as data
    environ=earthaccess.auth_environ(),  # Forward Earthdata auth to cloud VMs
    spot_policy="spot_with_fallback",    # Use spot instances when available
    arm=True,                            # Use ARM-based instances
    cpu=1,                               # Use single-core instances
)
def process(granule):
    """Process a single data granule to extract Great Lakes temperature data"""
    results = []
    with tempfile.TemporaryDirectory() as tmpdir:
        files = earthaccess.download(granule, tmpdir)
        for file in files:
            ds = xr.open_dataset(os.path.join(tmpdir, file))
            # Select Great Lakes region by longitude/latitude
            ds = ds.sel(lon=slice(-93, -76), lat=slice(41, 49))
            # Filter for water temperature (exclude ice-covered areas)
            cond = (ds.sea_ice_fraction < 0.15) | np.isnan(ds.sea_ice_fraction)
            result = ds.analysed_sst.where(cond)
            results.append(result)
    return xr.concat(results, dim="time")

# Step 4: Run processing across all files in parallel
results = process.map(granules)

# Step 5: Combine results and visualize
ds = xr.concat(results, dim="time")

# Calculate temperature standard deviation across time
plt.figure(figsize=(14, 6))
std_temp = ds.std("time")
std_temp.plot(x="lon", y="lat", cmap="viridis")
plt.title("Standard Deviation of Sea Surface Temperature (2020-2021)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.savefig("great_lakes_sst_variation.png")
plt.show()

import coiled
import os
import tempfile
import earthaccess
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

# Step 1: Set up Earthdata authentication
# You'll need a free NASA Earthdata account (https://urs.earthdata.nasa.gov/)
earthaccess.login()

# Step 2: Find the dataset files we want to analyze
granules = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",  # Sea Surface Temperature dataset
    temporal=("2020-01-01", "2021-12-31"),  # Two years of data
)

# Step 3: Create a function to process each data file
@coiled.function(
    region="us-west-2",                  # Run in the same region as data
    environ=earthaccess.auth_environ(),  # Forward Earthdata auth to cloud VMs
    spot_policy="spot_with_fallback",    # Use spot instances when available
    arm=True,                            # Use ARM-based instances
    cpu=1,                               # Use single-core instances
)
def process(granule):
    """Process a single data granule to extract Great Lakes temperature data"""
    results = []
    with tempfile.TemporaryDirectory() as tmpdir:
        files = earthaccess.download(granule, tmpdir)
        for file in files:
            ds = xr.open_dataset(os.path.join(tmpdir, file))
            # Select Great Lakes region by longitude/latitude
            ds = ds.sel(lon=slice(-93, -76), lat=slice(41, 49))
            # Filter for water temperature (exclude ice-covered areas)
            cond = (ds.sea_ice_fraction < 0.15) | np.isnan(ds.sea_ice_fraction)
            result = ds.analysed_sst.where(cond)
            results.append(result)
    return xr.concat(results, dim="time")

# Step 4: Run processing across all files in parallel
results = process.map(granules)

# Step 5: Combine results and visualize
ds = xr.concat(results, dim="time")

# Calculate temperature standard deviation across time
plt.figure(figsize=(14, 6))
std_temp = ds.std("time")
std_temp.plot(x="lon", y="lat", cmap="viridis")
plt.title("Standard Deviation of Sea Surface Temperature (2020-2021)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.savefig("great_lakes_sst_variation.png")
plt.show()

After running this code, we'll explore what happened and why cloud-based processing is transformative for Earth science data analysis.

The Problem#

NASA's Earth Observing System Data and Information System (EOSDIS) hosts petabytes of Earth science data—including measurements of sea surface temperatures, precipitation, atmospheric composition, and more. These datasets are invaluable for climate research, environmental monitoring, and disaster response.

The challenge? The Multi-scale Ultra-high Resolution (MUR) Sea Surface Temperature dataset we're using is about 500GB for the time period we selected. Downloading this much data would:

Take 6+ hours on a typical internet connection
Cost approximately $25 in AWS data egress fees
Require sufficient local storage space
Demand substantial memory for processing

Scientists often find themselves spending more time managing data than analyzing it. This problem is only growing as datasets increase in size and resolution.

Cloud-Native Data Processing#

Instead of downloading data to our laptop, we'll bring our computation to the data using Coiled. This cloud-native approach offers massive advantages:

# Define a function to process data files
@coiled.function(
    region="us-west-2",                  # Run in the same region as data
    environ=earthaccess.auth_environ(),  # Forward Earthdata auth to cloud VMs
    spot_policy="spot_with_fallback",    # Use spot instances when available
    arm=True,                            # Use ARM-based instances
    cpu=1,                               # Use single-core instances
)
def process(granule):
    # Function body stays the same as it would locally
    ...

# Define a function to process data files
@coiled.function(
    region="us-west-2",                  # Run in the same region as data
    environ=earthaccess.auth_environ(),  # Forward Earthdata auth to cloud VMs
    spot_policy="spot_with_fallback",    # Use spot instances when available
    arm=True,                            # Use ARM-based instances
    cpu=1,                               # Use single-core instances
)
def process(granule):
    # Function body stays the same as it would locally
    ...

The @coiled.function decorator transforms our local function into one that runs in the cloud.

Key cloud configuration details:

Region: We run in us-west-2 because that's where NASA stores this dataset
Authentication: We forward our Earthdata credentials to the cloud workers
Cost optimization: We use spot instances, ARM processors, and right-sized VMs

Parallelizing the Workload#

The real power comes when we run our function across many files in parallel:

# Traditional approach - slow serial processing
for granule in granules:
    result = process(granule)
    results.append(result)

# Traditional approach - slow serial processing
for granule in granules:
    result = process(granule)
    results.append(result)

# Cloud approach - fast parallel processing
results = process.map(granules)

# Cloud approach - fast parallel processing
results = process.map(granules)

With a single line change, we're now processing hundreds of files simultaneously. Coiled automatically:

Spins up the right number of cloud VMs
Distributes the workload
Gathers the results
Cleans up resources when done

Comparison of local vs cloud processing approaches

Analyzing the Results#

Once processing is complete, we combine the results from all files and calculate the standard deviation of sea surface temperatures over time:

# Combine all processed data
ds = xr.concat(results, dim="time")

# Calculate and plot temperature variation
plt.figure(figsize=(14, 6))
std_temp = ds.std("time")
std_temp.plot(x="lon", y="lat", cmap="viridis")

# Combine all processed data
ds = xr.concat(results, dim="time")

# Calculate and plot temperature variation
plt.figure(figsize=(14, 6))
std_temp = ds.std("time")
std_temp.plot(x="lon", y="lat", cmap="viridis")

The resulting visualization reveals fascinating patterns in temperature variability across the Great Lakes:

Standard deviation of sea surface temperature in the Great Lakes

Areas with higher standard deviation (brighter colors) indicate locations where temperature fluctuates more dramatically throughout the year.

Results#

Our cloud-native approach delivered impressive performance improvements:

Metric	Local Processing	Cloud Processing	Improvement
Runtime	6.4 hours	9 minutes	42× faster
Cost	$25 (data egress)	$0.36	70× cheaper
Data downloaded	500GB	0GB	100% reduction

The dramatic improvements come from:

Data-proximate computing: Running code where the data is stored
Massive parallelization: Processing many files simultaneously
Cost optimization: Using the right instances for the job

Next Steps#

Here are some ways you could extend this example:

Analyze different NASA Earth datasets like atmospheric composition or precipitation
Compare sea surface temperatures across different regions or time periods
Identify correlations between sea surface temperature and other climate variables
Implement more complex analyses like anomaly detection or trend analysis
Scale to even larger datasets by adjusting the cluster size

You could also try processing different NASA datasets available in the cloud:

# Search for aerosol optical depth data
granules = earthaccess.search_data(
    short_name="MOD04_L2",
    temporal=("2020-01-01", "2020-01-31")
)

# Search for aerosol optical depth data
granules = earthaccess.search_data(
    short_name="MOD04_L2",
    temporal=("2020-01-01", "2020-01-31")
)

Get started

Know Python? Come use the cloud. Your first 10,000 CPU-hours per month are on us.

Book a demo

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉