How popular is Matplotlib?

Churn through 5TB of scientific articles in five minutes.

Introduction#

We search through every scholarly article published on arXiv (repository of scientific papers) to track the adoption of Matplotlib over time.

The calculation processes 5TB of PDFs in five minutes, costing around $0.90. You can run it right now. You'll need the following packages:

pip install pandas matplotlib s3fs coiled

pip install pandas matplotlib s3fs coiled

Full code#

Then run this. If you're new to Coiled this will run for free on our account.

import coiled
import tarfile
import pandas as pd
import s3fs
import io


# Define function to query PDF directories in S3
@coiled.function(
    region="us-east-1",  # Local to data
    cpu=1,
    arm=True,
)
def extract(filename: str):
    """ Extract and process one directory of arXiv data

    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name,
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

# Get a list of all the directories in the S3 bucket
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")

# Run in parallel on each directory
results = extract.map(directories)

# Post-process the results into a pandas DataFrame
lists = list(results)

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"])
    for list in lists
]

df = pd.concat(dfs)

def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year

    return pd.Timestamp(year=year, month=month, day=1)

df["date"] = df.filename.map(date)

# Plot the results
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

by_month = df.groupby("date").has_matplotlib.mean()
fig, ax = plt.subplots(layout="constrained")
ax.plot(by_month, "o", color="k", ms=3)

ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)

ax.grid(axis="y")

ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))

ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv")

import coiled
import tarfile
import pandas as pd
import s3fs
import io


# Define function to query PDF directories in S3
@coiled.function(
    region="us-east-1",  # Local to data
    cpu=1,
    arm=True,
)
def extract(filename: str):
    """ Extract and process one directory of arXiv data

    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name,
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

# Get a list of all the directories in the S3 bucket
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")

# Run in parallel on each directory
results = extract.map(directories)

# Post-process the results into a pandas DataFrame
lists = list(results)

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"])
    for list in lists
]

df = pd.concat(dfs)

def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year

    return pd.Timestamp(year=year, month=month, day=1)

df["date"] = df.filename.map(date)

# Plot the results
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

by_month = df.groupby("date").has_matplotlib.mean()
fig, ax = plt.subplots(layout="constrained")
ax.plot(by_month, "o", color="k", ms=3)

ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)

ax.grid(axis="y")

ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))

ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv")

After you've run it we'll dig into what actually happened there, section by section.

The Problem#

Matplotlib is a popular Python library for creating visualizations. It's used in a wide range of applications, from data analysis to scientific computing to data visualization. We're curious about its adoption over time among scientific papers.

arXiv is the public repository of scientific papers, and as of 2025, contains over 2.4 million papers. These are conveniently stored in a S3 bucket at s3://arxiv/pdf. Each file is a tarball of related PDFs, collectively totaling about 5TB of data.

Processing this amount of data locally would be extremely slow. Even with a fast internet connection, downloading 5TB would take days, and processing it on a single machine could take over a week. We need to move the computation to the cloud, close to where the data lives.

Getting the data#

We'll use the s3fs library to access the data in the S3 bucket.

# Get a list of all the directories in the S3 bucket
import s3fs
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")

# Get a list of all the directories in the S3 bucket
import s3fs
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")

Each directory is a tarball of related PDFs, and we can use the tarfile library to extract the PDFs from the tarballs. We'll get to that in a moment.

Parsing code#

It's easy to see if any particular PDF contains a Matplotlib image because matplotlib adds the bytes b"matplotlib" to the header of all images it creates.

And so we can make a function, extract, that takes a path in s3 to a tarball of PDFs, and returns a list of filenames and a boolean indicating whether or not each file contains a Matplotlib image.

# Define function to query PDF directories in S3

def extract(filename: str):
    """ Extract and process one directory of arXiv data

    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name,
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

# Define function to query PDF directories in S3

def extract(filename: str):
    """ Extract and process one directory of arXiv data

    Returns
    -------
    filename: str
    contains_matplotlib: boolean
    """
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            try:
                with tarfile.TarFile(fileobj=bio) as tf:
                    for member in tf.getmembers():
                        if member.isfile() and member.name.endswith(".pdf"):
                            data = tf.extractfile(member).read()
                            out.append((
                                member.name,
                                b"matplotlib" in data.lower()
                            ))
            except tarfile.ReadError:
                pass
            return out

Setting Up the Cloud Hardware#

Running this code on one directory takes around a minute (mostly bound by local internet speeds). Running it across each of the thousands of directories would take around a week. We need to scale out to the cloud.

We'll use Coiled's serverless functions to run our analysis directly on cloud resources that are co-located with the data in AWS US-East-1 region. This minimizes data transfer and maximizes performance.

# Decorate function to run on cloud hardware

@coiled.function(
    region="us-east-1",  # Co-locate with the data
    cpu=1,               # 1 CPU per function is sufficient
    arm=True,            # ARM processors offer good cost/performance
)
def extract(filename: str):
    # Function implementation as shown above
    ...

# Decorate function to run on cloud hardware

@coiled.function(
    region="us-east-1",  # Co-locate with the data
    cpu=1,               # 1 CPU per function is sufficient
    arm=True,            # ARM processors offer good cost/performance
)
def extract(filename: str):
    # Function implementation as shown above
    ...

By specifying these hardware requirements, we ensure:

Our functions run in the same AWS region as the data, reducing latency
We use cost-effective single-core ARM processors for better value
Each task gets the right amount of resources for the job

When we call extract.map(directories), Coiled automatically scales up hundreds of workers to process all directories in parallel.

Parallelizing#

We could run this function locally in a loop, but that would be slow:

# Run in a loop (would take ~1 week)
results = []
for directory in directories:
    result = extract(directory)
    results.append(result)

# Run in a loop (would take ~1 week)
results = []
for directory in directories:
    result = extract(directory)
    results.append(result)

Instead, we use Coiled's distributed capabilities to run hundreds of these functions in parallel:

# Run in parallel (takes ~5 minutes)
results = extract.map(directories)

# Run in parallel (takes ~5 minutes)
results = extract.map(directories)

This spins up several hundred machines that process the data simultaneously, reducing the runtime from a week to minutes.

Post-processing results#

Now that we have a list of results, we can post-process them in a pandas DataFrame, collecting all of the results into a single DataFrame, and adding sensible date values extracted from the filename itself.

# Post-process the results into a pandas DataFrame
import pandas as pd

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"])
    for list in lists
]

df = pd.concat(dfs)
def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year

    return pd.Timestamp(year=year, month=month, day=1)

df["date"] = df.filename.map(date)

# Post-process the results into a pandas DataFrame
import pandas as pd

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"])
    for list in lists
]

df = pd.concat(dfs)
def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year

    return pd.Timestamp(year=year, month=month, day=1)

df["date"] = df.filename.map(date)

Plotting results#

Now we can make a nice plot (using matplotlib of course)

# Plot the results
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

import pandas as pd

# read data
by_month = df.groupby("date").has_matplotlib.mean()


# get figure
fig, ax = plt.subplots(layout="constrained")
# plot the data
ax.plot(by_month, "o", color="k", ms=3)

# over-ride the default auto limits
ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)

# turn on a horizontal grid
ax.grid(axis="y")

# remove the top and right spines
ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

# format y-ticks a percent
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))

# add title and labels
ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv")

# Plot the results
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

import pandas as pd

# read data
by_month = df.groupby("date").has_matplotlib.mean()


# get figure
fig, ax = plt.subplots(layout="constrained")
# plot the data
ax.plot(by_month, "o", color="k", ms=3)

# over-ride the default auto limits
ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)

# turn on a horizontal grid
ax.grid(axis="y")

# remove the top and right spines
ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

# format y-ticks a percent
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))

# add title and labels
ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv")

Giving us our final image:

The results show that Matplotlib adoption has grown consistently since 2004, with usage in scientific papers increasing from nearly 0% to around 20% by 2024. This dramatic increase highlights Matplotlib's growing importance in scientific visualization and reproducible research.

Next Steps#

Here are some ways you could extend this example:

Try analyzing other libraries by searching for different keywords (e.g., "seaborn", "plotly", "ggplot")
Add more complex analysis to categorize papers by field and compare visualization library usage across disciplines
Extract more data from the PDFs, such as references to Python packages or programming languages

Get started

Know Python? Come use the cloud. Your first 500 CPU hours per month are on us.

Book a demo

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉

$ pip install coiled
$ coiled quickstart

Grant cloud access? (Y/n): Y

... Configuring  ...

You're ready to go. 🎉