Accelerating Data Science with BlazingSQL and Dask
• September 6, 2020
Rodrigo and Felipe Aramburu, the brothers that lead BlazingSQL, recently joined us to discuss how they are empowering folks around the world to do GPU-accelerated data science in Python…with SQL!
There’s an entire community of data analysts out there that need to become programmatically proficient for their long-term career trajectory. And for us, SQL is a great onramp for enabling these individuals to become dangerous and powerful relatively quickly.”
In this post, we’ll summarize the key takeaways from the stream. We cover:
- Their background
- SQL, DataFrames, and GPUs together!
- Big deal use cases
- Live coding: BlazingSQL pipelines through to data visualization
- Blazing, multi-GPUs, aaaaaand Dask!
You can find the code for the session at app.blazingsql.com in the intro_notebooks folder.
Introducing the BlazingSQL family!
Rodrigo (CEO) and Felipe (CTO) shared a little bit about themselves at the top of the stream:
- They’ve been into GPUs since “2014-ish” when they were consultants;
- Now they’re building this distributed SQL engine (BlazingSQL) that runs entirely in Python to help people accelerate their PyData analytics stack.
GPU-accelerated SQL in Python? Sounds ridiculous, right?
Felipe responded, “It is very easy to write SQL. Writing Python is wonderful and great but it’s a little bit harder. BlazingSQL opens up GPU-accelerated data science to more people. They can start in SQL and then when they want to do something a little bit fancy, they can move up to Python.”
SQL, DataFrames, and GPUs together!
BlazingSQL is a SQL engine on DataFrames.
So how does BlazingSQL do GPU-accelerated SQL in Python? In Rodrigo’s own words,
“BlazingSQL is a SQL engine on DataFrames. DataFrames are a really easy way to handle and manipulate a table of data inside memory in Python. And BlazingSQL allows you to run a SQL query on that DataFrame.”
The DataFrame that BlazingSQL runs on is cuDF, which is the CUDA DataFrame library inside RAPIDS. It’s a GPU-accelerated DataFrame and looks similar to pandas.
Rodrigo hopped into a quick example:
“We’re just reading a CSV into memory. So this DataFrame now exists inside GPU memory. We can create a table off that that has a name on it and we can run whatever kind of SQL query on it we want. It’s ANSI SQL compliant. Runs entirely within Python. It’s a Python package that has low level C and C++ bindings to it, which is why it’s so high performant.”
“Now we can extend that where maybe we’re not creating a SQL table off an in-memory GPU DataFrame. Maybe we’re creating it off a Dask cuDF, so a distributed or partitioned cuDF. Maybe we’re creating it off of a pandas DataFrame, or a Dask DataFrame on pandas. Maybe you want to go all the way down to files like Apache Parquet files, CSV files, what have you. And maybe those files don’t exist inside the server or computer that you’re running, maybe they exist in the data lake. So how can we create a SQL engine that runs directly off of files and in-memory DataFrame representations that has really high performant query execution in Python.”
Matt Rocklin hopped in, “One of the common requests we get for Dask is, ‘Hey, do you support SQL? I love that [with Dask] I can read in some data from my S3 bucket, that I can do some custom Python manipulation, but then I want to hand it off to a SQL engine and send that to Tableau or something.’ And my answer has always been, ‘No, there is no good SQL system in Python.’ But now there is—if you have GPUs.”
Big deal use cases
That’s where our engine starts making a lot of sense.
Felipe shared two exciting customer use cases:
- Supercomputing. “A researcher at the Summit supercomputer is accelerating metadata research with us. If you have terabytes of metadata, sometimes you just want to look at it and explore it. So BlazingSQL is compelling for interactive, 3-5 second queries over terabytes of data for exploratory workflows.”
- Giving data scientists ownership of production pipelines. “What if you can take a PyData workflow and have it scale up not to a few gigabytes, but to a few terabytes so it can be put into production? This is what one of our retail customers is using us for. How can I put the ownership of the production pipelines on the data scientists, as opposed to having the data scientists hand it over to a data engineering team that then needs to go repurpose it in Apache Spark or something, and that ownership between the data scientist and the production pipeline is entirely lost. What if you can just interactively be working off of those large scale datasets and that creates performant code that can then get deployed into production. That’s where our engine starts making a lot of sense.”
The second bullet is a trend the Coiled team is seeing too, most recently in our live stream and blog post with Alex Egg from Grubhub.
Live coding: BlazingSQL pipelines through to data visualization
Even if I don’t know pandas, all of a sudden Python and [data visualization] is opening up to me because I know SQL.
Rodrigo started, “One of the problems we’re seeing for people wanting to use GPUs is actually gaining access to a GPU. A lot of people might not have one inside their laptop and they might not know how to spin one up in the cloud. So we created a simple free environment that anyone can use at app.blazingsql.com.” You can use it to follow along. They start in the data_visualization.ipynb notebook in the intro_notebooks folder.
- “I’m going to show a BlazingSQL query and how you can pipe it into Datashader, a data viz tool.”
- “If you’re familiar with PySpark, this context is probably pretty familiar:”
- “BlazingSQL is not a data warehouse. It’s not a DBMS. It doesn’t know things when you install it and start it up. We have to create this context which is going to maintain an awareness of what tables are called, where they are located, and what storage plugins or systems that we might be connecting to.”
- “Here is a quick example of how you can register a storage plugin. Using the BlazingContext that I’ve created, I register an S3 bucket. This is a totally public S3 bucket, and I’m going to create a BlazingSQL table on that bucket that references a file inside that S3 bucket.”
- “On the right side on JupyterLab, I have my GPU resources showing what’s going on. I created a table and we see a quick little spike. That just means that it read in the metadata real fast, and it now understands the schema and the row groups and what is associated with the table.”
- “Now, this query here gives me a cuDF DataFrame. And once I have that, I can start running pandas-like functions and methods on top of that DataFrame.”
- “Here is where things start to get interesting. Because cuDF is built on these open-source standards, we can very easily pass things from cuDF to pandas because it’s all built on top of Apache Arrow. I can take this cuDF and run `to_pandas` and now I have a pandas DataFrame. So maybe I want to rely on BlazingSQL and cuDF to do the heavy lifting of the ETL and data manipulation, but I want to be able to plug into something that understands pandas but doesn’t necessarily understand cuDF. If I want to plot with matplotlib, I can do `.plot()` as if it’s a pandas DataFrame (because it literally is one).”
- “Now I’m running SQL queries to get my viz. Even if I don’t know pandas, all of a sudden Python and viz is opening up to me because I know SQL.”
Blazing, multi-GPUs, aaaaaand Dask!
You pass Dask to BlazingSQL so BlazingSQL knows where the workers are.
Coiled, among other things, provides hosted and scalable Dask clusters. Here’s how Rodrigo leverages Dask:
- “This is how you get BlazingSQL in a multi-GPU (i.e., distributed) context. The most important thing is highlighted at the bottom. When you instantiate your BlazingContext, if you pass it a Dask client and that Dask client is aware of GPU workers across a cluster, then Dask will actually be what launches BlazingSQL workers across that entire cluster.”
- “This is an example using a local CUDA cluster, which finds all GPUs on a server. In this case it’s one, but if there were four, it would have found four workers. And that’s it! This is a multi-GPU and potentially multi-node Dask cluster that you can start interacting with.”
Rodrigo: “That’s exactly right. You pass Dask to BlazingSQL so BlazingSQL knows where the workers are.”
Matt: “You will never know the pain of creating software that interacts with every single possible cluster management system.” Note: Matt is a core developer of Dask.
Rodrigo: “Oh we do! That’s why we didn’t do it! Before BlazingSQL, we were building an old school DBMS that was called BlazingDB. One time I asked Felipe, ‘For this to be effective, I need to be able to deploy a BlazingDB cluster on my own relatively easy.’ And his response was, ‘You will never be able to deploy it. This is only for the hardcore of the hardcore.’ So yeah, we appreciate the work y’all do at Dask.”
BlazingSQL’s vision, in Rodrigo’s words:
“The PyData ecosystem is empowering for data scientists, but there’s an entire community of data analysts out there that are becoming programmatically proficient or need to learn to be for their long term career trajectory. And for us, SQL is a great on-ramp for enabling that for these individuals that have these capabilities to become dangerous and powerful relatively quickly. As opposed to having to start figuring out a new paradigm of data manipulation, which can be pretty vexing if you’ve never seen it before.”
We agree. We’re grateful for Rodrigo’s and Felipe’s time.
Do you need to do Data Science at Scale?
If you need to do data science at scale, you too can get started on a Coiled cluster immediately. Coiled also handles security, conda/docker environments, and team management, so you can get back to doing data science. Get started for free by joining our Beta.