How to Process Millions of Images with Dask and GPUs
• March 14, 2022
Image processing is a data-intensive task that requires significant computational resources. Many data science and machine learning teams working with large imagery datasets lose valuable time waiting for their analyses to run. This is time that could be spent interpreting the results and creating value.
Coiled enables data teams to process millions of images in parallel, significantly reducing the time it takes to generate valuable insights for your business.
Automate Image Processing for a 264x Speed-Up
Machine learning and AI are gaining rapid traction across a variety of sectors, from healthcare to geospatial and biosciences. Image processing plays an important role in these data streams, and the volume of available imagery data from sources like microscopy and satellite imagery is growing exponentially. This imagery data contains valuable information, but organizations’ infrastructures often can’t keep up with the exploding data volume. They are experiencing a problem of scalability.
Coiled customer Nanox Vision (previously, Zebra Medical) performs medical image processing for disease discovery. Their AI-powered computer vision software scans medical images and automatically detects undiagnosed medical conditions. At first, the team used data analysts to manually identify patterns in the images but this soon became untenable as the amount of data increased. They needed a more scalable approach.
Zebra Medical transitioned to using Dask on Coiled to implement machine learning models and automate training and validation of these models on large volumes of microscopy images. Read the full case study to learn how Zebra Medical achieved a 264x speed-up of their processing pipeline using Coiled.
Scale Image Processing to the Cloud
Python has become one of the de facto languages for the data science and machine learning industry over the past decade, but concerns have existed about its scalability. Companies often face an apparent dilemma: they want to continue working in Python, the language their developers and analysts are most comfortable and productive with, but also scale to datasets containing millions of images. Coiled solves this dilemma by implementing Dask to scale Python libraries for data science.
Coiled customer Development Seed processes satellite imagery for NASA at the order of a million images in 90 mins. Working together with Coiled, Development Seed was able to scale their pipeline to processing millions of images using Dask-clusters with GPUs in the cloud.
GPU clusters require custom configuration based on workloads and use case. Book a meeting to get set up with your own GPU cluster.
Build a Python Image Processing Architecture
The plethora of libraries and tools in the Python ecosystem can be overwhelming. Companies spend valuable time experimenting with pipeline architectures and implementations often vary within a company, making cross-team collaboration difficult and cumbersome. Coiled has decades of combined experience working with and building the foundational tools of the PyData ecosystem and supports data science teams in designing optimal image processing architectures customized to our customers’ specific use cases and needs.
Concerto Biosciences handles time-sensitive imagery data from scientific laboratories. Their analytics team runs multiple data pipelines a week and the turnaround time of the data-processing is crucial for the scientists who need to interpret and use the results. Using Dask, Concerto was able to reduce their pipeline runtime by 75%. The team at Coiled shared their extensive expertise of Python data science tools to help the Concerto team build out a complete image processing data architecture, including workflow orchestration, cloud computing and efficient data transfer.
Python Image Processing at Scale
Try Coiled Cloud for free to start scaling your Python image processing pipelines to the cloud:Get Started
Once you’re set up, check out our Dask-PyTorch tutorial for a step-by-step guide on using Dask for scaling PyTorch image-scoring pipelines.