Why You Should Save NumPy Arrays with Zarr

Richard Pelgrim January 5, 2022

, , , , ,


This post tells you why and how to use the Zarr format to save a numpy array. It walks you through the code to read and write a large NumPy array in parallel using Zarr and Dask. 

Here’s the code if you want to jump right in. If you have questions about the code, reach out to us on Slack.

Common Ways to Save NumPy Arrays

Three common ways to store a NumPy array are to store it as a .csv file , a text file or a .npy file. Each of these has important shortcomings:

  • CSV and TXT files are human-readable formats that can’t contain NumPy arrays larger than 2-dimensions.
  • The native NPY binary file format doesn’t support parallel read/write operations.

Let’s see an example of this in action below.

We’ll start by creating a dummy NumPy array as our input data. We’ll use np.random.rand to generate two arrays populated with random numbers: a small array array_XS with 2 dimensions and a larger array array_L with 3 dimensions.

import numpy as np
array_XS = np.random.rand(3,2)
array_L = np.random.rand(1000, 1000, 100)

You can store a 1D array or 2D array as .txt or .csv using the np.savetxt API. However, storing the 3-dimensional array_L as a txt file or a csv file will cause Python to throw the following error message:

# store array in text format
np.savetxt('array_L.txt', array_L, delimiter=" ")

# store array in csv format
np.savetxt('array_L.csv', array_L, delimiter=",")

ValueError: Expected 1D or 2D array, got 3D array instead

Using np save to store in NumPy NPY format

You could store the 3-dimensional array as a .npy file using the np save API. Keep in mind that this option can’t be scaled to large datasets because saving to a .npy file means arrays will be saved to a single file and can’t be processed in parallel.

NPY is a binary format best suited to saving arrays that are: n-dimensional, where n > 3, don’t need to be human-readable, and fit comfortably in local RAM.

# use np save to store in numpy npy format
np.save('array_L.npy', array_L)

Save NumPy Arrays with Zarr

Instead of using np savetxt or np save, we recommend you save NumPy array with Zarr. Zarr is a binary file format for the storage of chunked, compressed, N-dimensional arrays.

The three most important benefits of Zarr are that it:

1. Has multiple compression options and levels built-in

2. Supports direct reading and writing from multiple backend data stores (zip, S3, etc.)

3. Can read and write data in parallel* in n-dimensional compressed chunks

* Zarr supports concurrent reads and concurrent writes separately but not concurrent reads and writes at the same time.

Zarr has also been widely adopted across PyData libraries like Dask, TensorStore, and x-array, which means that you will see significant performance gains when using this file format together with supported libraries.

Compress NumPy Arrays with Zarr

Let’s see an example of Zarr’s compression options in action. Below, we’ll save the small and large arrays to .zarr and check the resulting file size.

import zarr

# save small NumPy array to zarr
zarr.save('array_XS.zarr', array_XS)

# get the size (in bytes) of the stored .zarr file
! stat -f '%z' array_XS.zarr
>> 128

# save large NumPy array to zarr
zarr.save('array_L.zarr', array_L)

# get the size of the stored .zarr directory
! du -h array_L.zarr

>> 693M	array_L.zarr

Storing the array_L as Zarr leads to a significant reduction in filesize (~15% for array_L), even with just the default out-of-the-box compression settings. Check out the accompanying notebook for more compression options you can tweak to increase performance.

Loading NumPy Arrays with Zarr

You can load arrays stored as .zarr back into your Python session using zarr.load().

# load in array from zarr
array_zarr = zarr.load('array_L.zarr')

It will be loaded into your Python session as a regular NumPy array.

type(array_zarr)

>> numpy.ndarray

Zarr supports multiple backend data stores. This means you can easily load .zarr files from cloud-based data stores, like Amazon S3, directly into your Python session:

# load small zarr array from S3
array_S = zarr.load("s3://coiled-datasets/synthetic-data/array-random-390KB.zarr")

Read and Write NumPy Arrays in Parallel with Zarr and Dask

If you’re working with data stored in the cloud, chances are that your data is larger than your local machine RAM. In that case, you can use Dask to read and write your large Zarr arrays in parallel.

Below we’ll work through an example that loads a 370GB .zarr file into our Python session directly:

array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

This fails. Python will throw the following error:

---------------------------------------------------------------------------
MemoryError          Traceback (most recent call last)
<ipython-input-5-7969a01a46fb> in <module>
----> 1 array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/convenience.py in load(store)
    361     store = normalize_store_arg(store)
    362     if contains_array(store, path=None):
--> 363         return Array(store=store, path=None)[...]
    364     elif contains_group(store, path=None):
    365         grp = Group(store=store, path=None)

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
    671 
    672         fields, selection = pop_fields(selection)
--> 673         return self.get_basic_selection(selection, fields=fields)
    674 
    675     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
    797         else:
    798             return self._get_basic_selection_nd(selection=selection, out=out,
--> 799                                                 fields=fields)
    800 
    801     def _get_basic_selection_zd(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_nd(self, selection, out, fields)
    839         indexer = BasicIndexer(selection, self)
    840 
--> 841         return self._get_selection(indexer=indexer, out=out, fields=fields)
    842 
    843     def get_orthogonal_selection(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields)
   1118         # setup output array
   1119         if out is None:
-> 1120             out = np.empty(out_shape, dtype=out_dtype, order=self._order)
   1121         else:
   1122             check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 373. GiB for an array with shape (10000, 10000, 500) and data type float64

Loading the same 370GB .zarr file into a Dask array works fine:

dask_array = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

dask_array

This is because Dask evaluates lazily. The array is not read into memory until you specifically instruct Dask to perform a computation on the dataset.

This means you can perform some computations on this dataset locally. But loading the entire array into local RAM will still fail or cause Dask to spill to disk.

NOTE: Even if your machine may technically have the storage resources to spill this dataset to disk, this will significantly reduce performance. 

Scale Python to Dask Cluster with Coiled

We’ll need to run this on a remote cluster to access additional hardware resources. 

To do this:

  1. Spin up a Coiled cluster
cluster = coiled.Cluster(
    name="create-synth-array",
    software="coiled-examples/numpy-zarr",
    n_workers=50,
    worker_cpu=4,
    worker_memory="24Gib",
    backend_options={'spot':'True'},
)

2. Connect Dask to this cluster

from distributed import Client
client = Client(cluster)

3. And then run computations over the entire cluster comfortably.

da_1 = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

da_2 = da_1.T

da_2

%%time
da_2.to_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB-T.zarr")

CPU times: user 2.26 s, sys: 233 ms, total: 2.49 s
Wall time: 1min 32s

Our Coiled cluster has 50 Dask workers with 24Gib RAM each, all running a pre-compiled software environment containing the necessary Python dependencies. This means we have enough resources to comfortably transpose the array and write it back to S3. 

Dask is able to do all of this for us in parallel and without ever loading the array into our local memory. It has loaded, transformed, and saved an array of 372GB back to S3 in less than 2 minutes.

Saving NumPy Arrays Summary

Let’s recap: 

  • There are important limitations to many of the storing NumPy arrays as a text file, csv file or npy file.
  • The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays in parallel.
  • Dask allows you to scale Python and use the parallel read/write capabilities of Zarr to their full potential.
  • Connecting Dask to an on-demand Coiled cluster allows for lightning-fast computations over larger-than-memory datasets.

To get started with Coiled, create a free account here using your Github credentials. You can learn more about our product by clicking below!

Try Coiled Cloud


Ready to get started?

Create your first cluster in minutes.