Data Scientists are increasingly using Python and the Python ecosystem of tools for their analysis. Combined with the growing popularity of big data, this brings the challenge of scaling data science workflows. Dask is a library built for this exact purpose - making it easy to scale your Python code, and serve as a toolbox for distributed computing!
In this post, we will discuss:
Learn more about Python, Dask, and scalable compute in our comprehensive guide: Scaling Data Science in Python with Dask.
Hugo Bowne-Anderson, the Head of Data Science Evangelism and Marketing at Coiled, introducing the new guide - Scaling Data Science in Python with Dask.
Python’s wide adoption is attributed to three key factors:
Python has made Data Science more accessible to everyone from researchers to analysts to students:
The Scientific Python community is growing rapidly, hence we also have a growing pool of resources to support new users!
Recent advancements in technology, algorithms, and computational power has led to a “big data revolution”. Data scientists require scalable solutions to work with these large and complex datasets, and they look towards parallel and distributed computing.
Scalable computing can be anything that extends beyond a single thread on a single core. Parallel computing is the process of computing multiple tasks simultaneously to accelerate workflows, and distributed computing is the process of leveraging computing resources of multiple machines. Learn more about these concepts in Scaling Data Science in Python with Dask!
Another important concept in scalable computing is cloud computing. Providers like AWS, Azure, and GCP, provide a lot of computing resources that can be accessed from anywhere. Cloud computing involves leveraging these resources to perform big computations.
Dask is a library for parallel and distributed computing in Python. Dask makes it easy to use all resources on your local machine, set up distributed computing environments, as well as scale to the cloud. It’s familiar API, flexible design, and synergy with the Python ecosystem make Dask the tool of choice for both individuals and institutions.
Dask is used by industry leaders from Walmart and Capital One, to government agencies like NASA and the UK Met Office, among many others to scale their data science and machine learning pipelines. Dask is also used behind-the-scenes in many popular tools including Nvidia RAPIDS, Apache Airflow, and PyTorch!
Dask is a very powerful library, but scaling Dask computations to the cloud involves many DevOps challenges like networking, security, and environment configurations. Coiled Cloud is a product built by the creators and maintainers of Dask, that takes care of these challenges, so that data scientists can focus on their analysis. Coiled lets you scale to the cloud in just one-click, while also providing essential features for team and cost management.
If you’re getting started with scalable computing, want to learn more about Dask and Coiled, or if you’re just curious about high-performance computing, check out our complete guide by clicking the link below!