Why You Should Save NumPy Arrays with Zarr
• January 5, 2022
This post tells you why and how to use the Zarr format to save a numpy array. It walks you through the code to read and write a large NumPy array in parallel using Zarr and Dask.
Here’s the code if you want to jump right in. If you have questions about the code, reach out to us on Slack.

Common Ways to Save NumPy Arrays
Three common ways to store a NumPy array are to store it as a .csv file , a text file or a .npy file. Each of these has important shortcomings:
- CSV and TXT files are human-readable formats that can’t contain NumPy arrays larger than 2-dimensions.
- The native NPY binary file format doesn’t support parallel read/write operations.
Let’s see an example of this in action below.
We’ll start by creating a dummy NumPy array as our input data. We’ll use np.random.rand
to generate two arrays populated with random numbers: a small array array_XS
with 2 dimensions and a larger array array_L
with 3 dimensions.
import numpy as np array_XS = np.random.rand(3,2) array_L = np.random.rand(1000, 1000, 100)
You can store a 1D array or 2D array as .txt or .csv using the np.savetxt
API. However, storing the 3-dimensional array_L
as a txt file or a csv file will cause Python to throw the following error message:
# store array in text format np.savetxt('array_L.txt', array_L, delimiter=" ") # store array in csv format np.savetxt('array_L.csv', array_L, delimiter=",") ValueError: Expected 1D or 2D array, got 3D array instead
Using np save to store in NumPy NPY format
You could store the 3-dimensional array as a .npy file using the np save API. Keep in mind that this option can’t be scaled to large datasets because saving to a .npy file means arrays will be saved to a single file and can’t be processed in parallel.
NPY is a binary format best suited to saving arrays that are: n-dimensional, where n > 3, don’t need to be human-readable, and fit comfortably in local RAM.
# use np save to store in numpy npy format np.save('array_L.npy', array_L)
Save NumPy Arrays with Zarr
Instead of using np savetxt
or np save
, we recommend you save NumPy array with Zarr. Zarr is a binary file format for the storage of chunked, compressed, N-dimensional arrays.
The three most important benefits of Zarr are that it:
1. Has multiple compression options and levels built-in
2. Supports direct reading and writing from multiple backend data stores (zip, S3, etc.)
3. Can read and write data in parallel* in n-dimensional compressed chunks
* Zarr supports concurrent reads and concurrent writes separately but not concurrent reads and writes at the same time.
Zarr has also been widely adopted across PyData libraries like Dask, TensorStore, and x-array, which means that you will see significant performance gains when using this file format together with supported libraries.
Compress NumPy Arrays with Zarr
Let’s see an example of Zarr’s compression options in action. Below, we’ll save the small and large arrays to .zarr
and check the resulting file size.
import zarr # save small NumPy array to zarr zarr.save('array_XS.zarr', array_XS) # get the size (in bytes) of the stored .zarr file ! stat -f '%z' array_XS.zarr >> 128 # save large NumPy array to zarr zarr.save('array_L.zarr', array_L) # get the size of the stored .zarr directory ! du -h array_L.zarr >> 693M array_L.zarr
Storing the array_L
as Zarr leads to a significant reduction in filesize (~15% for array_L
), even with just the default out-of-the-box compression settings. Check out the accompanying notebook for more compression options you can tweak to increase performance.
Loading NumPy Arrays with Zarr
You can load arrays stored as .zarr
back into your Python session using zarr.load()
.
# load in array from zarr array_zarr = zarr.load('array_L.zarr')
It will be loaded into your Python session as a regular NumPy array.
type(array_zarr) >> numpy.ndarray
Zarr supports multiple backend data stores. This means you can easily load .zarr
files from cloud-based data stores, like Amazon S3, directly into your Python session:
# load small zarr array from S3 array_S = zarr.load("s3://coiled-datasets/synthetic-data/array-random-390KB.zarr")
Read and Write NumPy Arrays in Parallel with Zarr and Dask
If you’re working with data stored in the cloud, chances are that your data is larger than your local machine RAM. In that case, you can use Dask to read and write your large Zarr arrays in parallel.
Below we’ll work through an example that loads a 370GB .zarr file into our Python session directly:
array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")
This fails. Python will throw the following error:
--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-5-7969a01a46fb> in <module> ----> 1 array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr") ~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/convenience.py in load(store) 361 store = normalize_store_arg(store) 362 if contains_array(store, path=None): --> 363 return Array(store=store, path=None)[...] 364 elif contains_group(store, path=None): 365 grp = Group(store=store, path=None) ~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection) 671 672 fields, selection = pop_fields(selection) --> 673 return self.get_basic_selection(selection, fields=fields) 674 675 def get_basic_selection(self, selection=Ellipsis, out=None, fields=None): ~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields) 797 else: 798 return self._get_basic_selection_nd(selection=selection, out=out, --> 799 fields=fields) 800 801 def _get_basic_selection_zd(self, selection, out=None, fields=None): ~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_nd(self, selection, out, fields) 839 indexer = BasicIndexer(selection, self) 840 --> 841 return self._get_selection(indexer=indexer, out=out, fields=fields) 842 843 def get_orthogonal_selection(self, selection, out=None, fields=None): ~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields) 1118 # setup output array 1119 if out is None: -> 1120 out = np.empty(out_shape, dtype=out_dtype, order=self._order) 1121 else: 1122 check_array_shape('out', out, out_shape) MemoryError: Unable to allocate 373. GiB for an array with shape (10000, 10000, 500) and data type float64
Loading the same 370GB .zarr file into a Dask array works fine:
dask_array = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr") dask_array

This is because Dask evaluates lazily. The array is not read into memory until you specifically instruct Dask to perform a computation on the dataset.
This means you can perform some computations on this dataset locally. But loading the entire array into local RAM will still fail or cause Dask to spill to disk.
NOTE: Even if your machine may technically have the storage resources to spill this dataset to disk, this will significantly reduce performance.
Scale Python to Dask Cluster with Coiled
We’ll need to run this on a remote cluster to access additional hardware resources.
To do this:
- Spin up a Coiled cluster
cluster = coiled.Cluster( name="create-synth-array", software="coiled-examples/numpy-zarr", n_workers=50, worker_cpu=4, worker_memory="24Gib", backend_options={'spot':'True'}, )
2. Connect Dask to this cluster
from distributed import Client client = Client(cluster)
3. And then run computations over the entire cluster comfortably.
da_1 = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr") da_2 = da_1.T da_2
%%time da_2.to_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB-T.zarr") CPU times: user 2.26 s, sys: 233 ms, total: 2.49 s Wall time: 1min 32s

Our Coiled cluster has 50 Dask workers with 24Gib RAM each, all running a pre-compiled software environment containing the necessary Python dependencies. This means we have enough resources to comfortably transpose the array and write it back to S3.
Dask is able to do all of this for us in parallel and without ever loading the array into our local memory. It has loaded, transformed, and saved an array of 372GB back to S3 in less than 2 minutes.
Saving NumPy Arrays Summary
Let’s recap:
- There are important limitations to many of the storing NumPy arrays as a text file, csv file or npy file.
- The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays in parallel.
- Dask allows you to scale Python and use the parallel read/write capabilities of Zarr to their full potential.
- Connecting Dask to an on-demand Coiled cluster allows for lightning-fast computations over larger-than-memory datasets.
To get started with Coiled, create a free account here using your Github credentials. You can learn more about our product by clicking below!