Effective Data Storytelling for Larger-Than-Memory Datasets with Streamlit, Dask, and Coiled

Richard Pelgrim July 14, 2021

, ,


tl;dr

Integrating Coiled and Streamlit allows you to create intuitive, interactive web applications that can process large amounts of data effortlessly. This blog post walks you through writing an integrated Coiled/Streamlit script to present 10+ GB of data in an interactive heatmap visualisation. We then expand the script to include: 1) a heavier groupby computation, 2) interactive widgets to scale the Coiled cluster up or down, and 3) an option to shut down the cluster on demand.

* * *

Streamlit makes it easier and easier for data scientists to build lightweight, intuitive web applications without writing any frontend code. You don’t even have to leave the friendly confines of Python; it’s that good! 

That said, working with a front-end solution like Streamlit can become cumbersome when the size of your dataset or computation increases beyond something that completes within a few seconds. It’s likely that you’re using Streamlit in the first place because you want to create a smoother, more intuitive user experience. And my guess is that having your user sit around for minutes (or hours!) while a computation runs doesn’t make it into your definition of “smooth”…

You could, of course, choose to pay for and manage expensive clusters of machines (either locally or on the cloud). But unless your Streamlit app is as popular as Netflix, it’s likely that your cluster will sit idle for long periods of time. That means time and money wasted. Not great, either!

Delegating your heavy compute to a Dask cluster could well be worth considering here. Coiled allows you to spin up on-demand Dask clusters in the cloud without having to worry about any of the DevOps like setting up nodes, security, scaling, or even shutting the cluster down. Joining forces in a single web app, Streamlit handles the frontend layout and interactivity of your application while Coiled sorts out the backend infrastructure for demanding computations. It’s a  match made in heaven!

This blog post will show you how to build your very own Coiled + Streamlit integration. We’ll start with a basic script (from our own docs) that loads more than 10 GBs of data from the NYC Taxi dataset into an interactive user interface. From there, we’ll expand on that to really get the most out of Dask and Coiled by running an even heavier workload. Finally, we’ll tweak our Streamlit interface to allow the user to scale the cluster up and down using a simple slider and include a button to shut down the cluster, giving the user even more control over their computational power — without having to do any coding at all!

You can download the basic and final, extended Python scripts from this GitHub repo. To code along,  you’ll need a Coiled Free Tier account, which you can set up using your GitHub credentials via cloud.coiled.io. Some basic familiarity with Dask and Streamlit is helpful, but not a must. 

You can also check out our video that walks you through using Coiled and Steamlit together by clicking below.

Visualizing a Larger-than-memory Dataset

The example script below uses Coiled and Streamlit to read more than 146 million records (10+ GB) from the NYC Taxi data set and visualize locations of taxi pickups and dropoffs. Let’s break down what’s happening in the script into X sections:

First, we import the Python libraries we need to run the script, in this case Coiled, Dask, Streamlit, and Folium.

import coiled
import dask
import dask.dataframe as dd
import folium
import streamlit as st
from dask.distributed import Client
from folium.plugins import HeatMap
from streamlit_folium import folium_static

In the next section, we create the front-end user interface with Streamlit. We start with some descriptive headers and text and then include two drop-down boxes to allow the user to select the kind of data they want to visualize.

# Text in Streamlit
st.header("Coiled and Streamlit")
st.subheader("Analyzing Large Datasets with Coiled and Streamlit")
st.write(
   """
   The computations for this Streamlit app are powered by Coiled, which provides on-demand, hosted Dask clusters in the cloud. Change the options below to view different visualizations of transportation pickups/dropoffs, then let Coiled handle all of the infrastructure and compute.
   """
)

# Interactive widgets in Streamlit
taxi_mode = st.selectbox("Taxi pickup or dropoff?", ("Pickups", "Dropoffs"))
num_passengers = st.slider("Number of passengers", 0, 9, (0, 9))

From there, we write the function that will spin up a Coiled cluster. This is where we specify the number of workers, the name of the cluster so we can reuse it later (this is crucial if you have multiple people viewing your Streamlit app), and the software environment to distribute to our scheduler and workers. See this page in our docs for more on how to set up software environments.

You can view any active and closed clusters, as well as your software environments and cluster configurations on the Coiled Cloud page (provided you’re signed in to your account).

# Start and connect to Coiled cluster
cluster_state = st.empty()

@st.cache(allow_output_mutation=True)
def get_client():
   cluster_state.write("Starting or connecting to Coiled cluster...")
   cluster = coiled.Cluster(
       n_workers=10,
       name="coiled-streamlit",
       software="coiled-examples/streamlit"
   )
   client = Client(cluster)
   return client


client = get_client()
if client.status == "closed":
   # In a long-running Streamlit app, the cluster could have shut down from idleness.
   # If so, clear the Streamlit cache to restart it.
   st.caching.clear_cache()
   client = get_client()
cluster_state.write(f"Coiled cluster is up! ({client.dashboard_link})")

Next, we load in the data from the public Amazon S3 bucket as a Dask DataFrame, specifying the columns we want to include and the blocksize of each partition. Note the call to df.persist() here. This persists the DataFrame on the cluster so that it doesn’t need to be reloaded every time the app refreshes. After this call, the dataset is available for immediate access as long as the cluster is running. 

# Load data (runs on Coiled)
@st.cache(hash_funcs={dd.DataFrame: dask.base.tokenize})
def load_data():
   df = dd.read_csv(
       "s3://nyc-tlc/trip data/yellow_tripdata_2015-*.csv",
       usecols=[
           "passenger_count",
           "pickup_longitude",
           "pickup_latitude",
           "dropoff_longitude",
           "dropoff_latitude",
           "tip_amount",
           "payment_type",
       ],
       storage_options={"anon": True},
       blocksize="16 MiB",
   )
   df = df.dropna()
   df.persist()
   return df


df = load_data()

Finally, we use the input of the Streamlit widgets above to create a subset of the data called map_data and pass that to the Folium map, specifying that we want it displayed as a heatmap rendering.

# Filter data based on inputs (runs on Coiled)
with st.spinner("Calculating map data..."):
   map_data = df[
       (df["passenger_count"] >= num_passengers[0])
       & (df["passenger_count"] <= num_passengers[1])
   ]

   if taxi_mode == "Pickups":
       map_data = map_data.iloc[:, [2, 1]]
   elif taxi_mode == "Dropoffs":
       map_data = map_data.iloc[:, [4, 3]]

   map_data.columns = ["lat", "lon"]
   map_data = map_data.loc[~(map_data == 0).all(axis=1)]
   map_data = map_data.head(500)

# Display map in Streamlit
st.subheader("Map of selected rides")
m = folium.Map([40.76, -73.95], tiles="cartodbpositron", zoom_start=12)
HeatMap(map_data).add_to(folium.FeatureGroup(name="Heat Map").add_to(m))
folium_static(m)

That’s it! Let’s see what this looks like. 

Note that this is a stand-alone Python script that you can run from your terminal using streamlit run <path/to/file>  — and not from a Jupyter Notebook.

So go ahead and run that in your terminal…and in a matter of seconds, your browser should present you with an interactive interface like the one below.

Pretty amazing right? Especially when you consider that with every refresh of the map, the app is processing over 146 million rows (that’s more than 10GB) of data in the blink of an eye!

Populating the Dask Dashboard

Let’s now move on to see how Coiled handles even heavier workloads. If you happened to click the URL to the Dask Dashboard, you would’ve seen that the computations to generate the map were completed in just a few tasks. While Dask handles that without skipping a beat, it is actually designed for distributed computing – and really shows its teeth when there’s a large number of tasks for it to run through. So let’s give it a chance to shine, shall we?

We’ll create a new section in the script that allows the user to set up a groupby computation. We’ll give the user the option to choose which column to group by…and which type of summary statistic to calculate. And include a button to trigger the computation.

# Performing a groupby
st.subheader(
   'Time for some heavier lifting!'
)

st.write(
   '''
   Let's move on to doing some heavier lifting to really see Dask in action. We'll try grouping a column and calculating a summary statistic for the tip amount.\n Select a column to group by below and a summary statistic to calculate:
   '''
)

# Interactive widgets in Streamlit
groupby_column = st.selectbox(
   "Which column do you want to group by?",
   ('passenger_count', 'payment_type')
)

aggregator = st.selectbox(
   "Which summary statistic do you want to calculate?",
   ("Mean", "Sum", "Median")
)

st.subheader(
   f"The {aggregator} tip_amount by {groupby_column} is:"
)

if st.button('Start Computation!'):
   with st.spinner("Performing your groupby aggregation..."):
       if aggregator == "Sum":
           st.write(
               df.groupby(groupby_column).tip_amount.sum().compute()
           )
       elif aggregator == "Mean":
           st.write(
               df.groupby(groupby_column).tip_amount.mean().compute()
           )
       elif aggregator == "Median":
           st.write(
               df.groupby(groupby_column).tip_amount.median().compute()
                          )

Saving the Python script and rerunning streamlit run <path/to/file> in your terminal will load the updated version of the Streamlit app, like the one below. 

We can customize our groupby computation with our new drop-down options. Clicking the new button triggers some heavy computation on our Coiled cluster, calculating a summary statistic of over 146 million rows in approximately 45 seconds. Impressive stuff!

Scaling and Shutting Down Your Coiled Cluster

But…if we’re being picky, it was a little overkill to use that entire cluster to generate the maps; which consisted of just a handful of tasks. And, on the other side of the spectrum, maybe you’re presenting this app to your overworked CEO right before an important board meeting and the last thing you want to do is have them stare at a turning wheel while the groupby computation runs…for 45 seconds.

If only there was a way to scale our cluster up or down depending on our computation needs…!

With a simple call to coiled.Cluster.scale() we can specify the number of workers that our cluster has. Note that we have to specify the name of the cluster to scale inside that call. Let’s go ahead and add a new section in our script where we attach that call to an interactive Streamlit slider. This means our user can now adjust their computational power as needed…right here in our web app, without having to write a single line of code! 

# Option to scale cluster up/down
st.subheader(
   "Scaling your cluster up or down"
)

st.write(
   '''
   By default, your Coiled Cluster spins up with 10 workers. You can scale this number up or down using the slider and button below.
   '''
)

num_workers = st.slider(
   "Number of workers",
   5,
   20,
   (10)
)

if st.button("Scale your cluster!"):
   coiled.Cluster(name='coiled-streamlit').scale(num_workers)

Note that while downscaling is instant, scaling a cluster up takes a minute or two. The good thing is that you can continue running your computation while the cluster scales. You can use the Coiled Cloud web interface to see how many workers your cluster currently has.

Finally, let’s build in a button that allows the user to shutdown the cluster to avoid unnecessary costs. Note that there’s a trade-off here: if you’re doing quick iterations of the Streamlit app, we recommend keeping the cluster running so you don’t have to wait for it to spin up every time you re-run the script. In this case, it’s important to name your cluster so that you can reference it in subsequent runs. If you’re all done for the foreseeable future, however, it’s good practice to shut the cluster down.

# Option to shutdown cluster
st.subheader(
   "Cluster Hygiene"
)

st.write(
   '''
   To avoid incurring unnecessary costs, click the button below to shut down your cluster.
   Note that this means that a new cluster will have to be spun up the next time you run the app.
   '''
)

if st.button('Shutdown Cluster'):
   with st.spinner("Shutting down your cluster..."):
       client.shutdown()

And we’ll just sneak in a pro-tip here: Coiled clusters by default shut down after 20 minutes of inactivity. You can tweak this by using the idle_timeout keyword argument to set your own preferred time-out window.

import coiled

cluster = coiled.Cluster(
    scheduler_options={"idle_timeout": "2 hours"}
)

Let’s Recap

We started out by running the Coiled + Streamlit example script from the Coiled Docs. We saw how quickly and effortlessly we were able to create an intuitive, interactive web application that could process 146 million rows of data. Next, we took this a little further and gave the users of our web app the ability to calculate a heavier computation on our Coiled cluster. We then supercharged our computation by building in an option to scale the cluster up (or down) as needed. Finally, we discussed when and how to shut down your cluster to avoid unnecessary costs.

We hope this blog post helps you create the impact you’re striving for in your data science workflows. If you have any questions or suggestions for future material that would be helpful, please do reach out to us on our Community Slack channel or by sending us an email at support@coiled.io. And as a reminder, you can get started with Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, for free today when you click below.

Try Coiled Cloud