arXiv is the preeminent repository for scholarly preprint articles, especially in scientific fields like physics, mathematics, and chemistry. It stores millions of journal articles used across science. It's also public access, and so we can just scrape the entire thing given enough compute power.
Watermark
Starting in the early 2010s, Matplotlib started including the bytes //<[[b"Matplotlib"]]> in every PNG and PDF that they produce. These bytes often persist in PDFs that contain Matplotlib plots, including the PDFs stored on arXiv. As a result, it's pretty simple to check if a PDF contains a Matplotlib image. All we have to do is scan through every PDF and look for these bytes; no parsing required.
The data is about 1TB in size. We're going to use Dask for this.
This is a good example of writing plain vanilla Python code to solve a problem, running into issues of scale, and then using Dask to easily jump over those problems.
Get all filenames
Our data is stored in a requester pays S3 bucket in the //<[[us-east-1]]> region. Each file is a tar file which contains a directory of papers.
Mostly we have to muck about with tar files. This wasn't hard. The //<[[tarfile]]> library is in the standard library. It's not beautiful, but it's also not hard to use.
In [4]:
import tarfile
import io
def extract(filename: str):
""" Extract and process one directory of arXiv data
Returns
-------
filename: str
contains_matplotlib: boolean
"""
out = []
with s3.open(filename) as f:
bytes = f.read()
with io.BytesIO() as bio:
bio.write(bytes)
bio.seek(0)
with tarfile.TarFile(fileobj=bio) as tf:
for member in tf.getmembers():
if member.isfile() and member.name.endswith(".pdf"):
data = tf.extractfile(member).read()
out.append((
member.name,
b"matplotlib" in data.lower()
))
return out
In [5]:
%%time
# See an example of its use
extract(directories[20])[:20]
We see that none of these files included a Matplotlib image. That's not surprising. The filenames start with "0011" which means year 2000, month 11. Matplotlib wasn't even around back then 🙂
Scale function to full dataset
Great, we can get a record of each file and whether or not it used Matplotlib. Each of these takes about a minute to run on my local machine. Processing all 5000 files would take 5000 minutes, or around 100 hours.
We can accelerate this in two ways:
Process closer to the data by spinning up resources in the same region on the cloud (this also reduces data transfer costs)
Use hundreds of workers in parallel
We can do this easily with Dask (parallel computing) and Coiled (set up Dask infrastructure)
Create Dask Cluster
We start a Dask cluster on AWS in the same region where the data is stored.
We mimic the local software environment on the cluster with //<[[package_sync=True]]>.
from dask.distributed import Client, wait
client = Client(cluster)
Map function across every directory
Let's scale up this work across all of the directories in our dataset.
Hopefully it will also be faster because the Dask workers are in the same region as the dataset itself.
In [8]:
%%time
futures = client.map(extract, directories)
wait(futures)
# We had one error in one file. Let's just ignore and move on.
good = [future for future in futures if future.status == "finished"]
lists = client.gather(good)
Now that we're done with the large data problem we can turn off Dask and proceed with pure Pandas. There's no reason to deal with scalable tools if we don't have to.
In [9]:
# Scale down now that we're done
cluster.close()
Enrich Data
Let's enhance our data a bit. The filenames of each file include the year and month when they were published. After extracting this data we'll be able to see a timeseries of Matplotlib adoption.
In [10]:
# Convert to Pandas
import pandas as pd
dfs = [
pd.DataFrame(list, columns=["filename", "has_matplotlib"])
for list in lists
]
df = pd.concat(dfs)
df
def date(filename):
year = int(filename.split("/")[0][:2])
month = int(filename.split("/")[0][2:4])
if year > 80:
year = 1900 + year
else:
year = 2000 + year
return pd.Timestamp(year=year, month=month, day=1)
date("0005/astro-ph0001322.pdf")
Out [11]:
Timestamp('2000-05-01 00:00:00')
Yup. That seems to work. Let's map this function over our dataset.
In [12]:
def date(filename):
year = int(filename.split("/")[0][:2])
month = int(filename.split("/")[0][2:4])
if year > 80:
year = 1900 + year
else:
year = 2000 + year
return pd.Timestamp(year=year, month=month, day=1)
date("0005/astro-ph0001322.pdf")
Now we can just fool around with Pandas and Matplotlib.
In [13]:
df.groupby("date").has_matplotlib.mean().plot(
title="Matplotlib Usage in arXiv",
ylabel="Fraction of papers"
).get_figure().savefig("results.png")
I did the plot above. Then Thomas Caswell (Matplotlib maintainer) came by and, in true form, made something much better 🙂
In [14]:
import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd
# read data
by_month = pd.read_parquet("results.parquet").groupby("date").has_matplotlib.mean()
# get figure
fig, ax = plt.subplots(layout="constrained")
# plot the data
ax.plot(by_month, "o", color="k", ms=3)
# over-ride the default auto limits
ax.set_xlim(left=datetime.date(2004, 1, 1))
ax.set_ylim(bottom=0)
# turn on a horizontal grid
ax.grid(axis="y")
# remove the top and right spines
ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)
# format y-ticks a percent
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))
# add title and labels
ax.set_xlabel("date")
ax.set_ylabel("% of all papers")
ax.set_title("Matplotlib usage on arXiv")
Out [14]:
Text(0.5, 1.0, 'Matplotlib usage on arXiv')
Yup. Matplotlib is used pretty commonly on arXiv. Go team.
Save results
This data was slightly painful to procure. Let's save the results locally for future analysis. That way other researchers can further analyze the results without having to muck about with parallelism or cloud stuff.
These are available at the github repository https://github.com/mocklin/arxiv-matplotlib if you want to play around with them.
Conclusion
Matplotlib + arXiv
It's incredible to see the steady growth of Matplotlib across arXiv. It's worth noting that this is all papers, even from fields like theoretical mathematics that are unlikely to include computer generated plots. Is this Matplotlib growing in popularity? Is it Python generally?
For future work, we should break this down by subfield. The filenames actually contained the name of the field for a while, like "hep-ex" for "high energy physics, experimental", but it looks like arXiv stopped doing this at some point. My guess is that there is a list mapping filenames to fields somewhere though. The filenames are all in the Pandas dataframe / parquet dataset, so doing this analysis shouldn't require any scalable computing.
Dask + Coiled
Dask and Coiled were built to make it easy to answer large questions.
We started this notebook with some generic Python code. When we wanted to scale up we invoked Dask+Coiled, did some work, and then tore things down, all in about ten minutes. The problem of scale or "big data" didn't get in the way of us analyzing data and making a delightful discovery.
This is exactly why these projects exist.
Future Work
There are many ways that this work could be extended (by you?)
We've made it easy to test all of the PDFs in arXiv. What else could we test for? Mentions of different libraries? The presence of some kind of file? Other plotting libraries?
Look at other pre-print servers like bioRxiv
This is a fun dataset representing the forefront of human science. It's now easy for us to inspect in its raw form. Fun!