# Why You Should Save NumPy Arrays with Zarr

January 5, 2022

This post tells you why and how to use the Zarr format to save a numpy array. It walks you through the code to read and write a large NumPy array in parallel using Zarr and Dask.

Here’s the code if you want to jump right in. If you have questions about the code, reach out to us on Slack.

## Common Ways to Save NumPy Arrays

Three common ways to store a NumPy array are to store it as a .csv file , a text file or a .npy file. Each of these has important shortcomings:

• CSV and TXT files are human-readable formats that can’t contain NumPy arrays larger than 2-dimensions.
• The native NPY binary file format doesn’t support parallel read/write operations.

Let’s see an example of this in action below.

We’ll start by creating a dummy NumPy array as our input data. We’ll use `np.random.rand` to generate two arrays populated with random numbers: a small array `array_XS` with 2 dimensions and a larger array `array_L` with 3 dimensions.

```import numpy as np
array_XS = np.random.rand(3,2)
array_L = np.random.rand(1000, 1000, 100)```

You can store a 1D array or 2D array as .txt or .csv using the `np.savetxt` API. However, storing the 3-dimensional `array_L` as a txt file or a csv file will cause Python to throw the following error message:

```# store array in text format
np.savetxt('array_L.txt', array_L, delimiter=" ")

# store array in csv format
np.savetxt('array_L.csv', array_L, delimiter=",")

ValueError: Expected 1D or 2D array, got 3D array instead```

### Using np save to store in NumPy NPY format

You could store the 3-dimensional array as a .npy file using the np save API. Keep in mind that this option can’t be scaled to large datasets because saving to a .npy file means arrays will be saved to a single file and can’t be processed in parallel.

NPY is a binary format best suited to saving arrays that are: n-dimensional, where n > 3, don’t need to be human-readable, and fit comfortably in local RAM.

```# use np save to store in numpy npy format
np.save('array_L.npy', array_L)```

## Save NumPy Arrays with Zarr

Instead of using `np savetxt` or `np save`, we recommend you save NumPy array with Zarr. Zarr is a binary file format for the storage of chunked, compressed, N-dimensional arrays.

The three most important benefits of Zarr are that it:

1. Has multiple compression options and levels built-in

2. Supports direct reading and writing from multiple backend data stores (zip, S3, etc.)

3. Can read and write data in parallel* in n-dimensional compressed chunks

* Zarr supports concurrent reads and concurrent writes separately but not concurrent reads and writes at the same time.

Zarr has also been widely adopted across PyData libraries like Dask, TensorStore, and x-array, which means that you will see significant performance gains when using this file format together with supported libraries.

## Compress NumPy Arrays with Zarr

Let’s see an example of Zarr’s compression options in action. Below, we’ll save the small and large arrays to `.zarr` and check the resulting file size.

```import zarr

# save small NumPy array to zarr
zarr.save('array_XS.zarr', array_XS)

# get the size (in bytes) of the stored .zarr file
! stat -f '%z' array_XS.zarr
>> 128

# save large NumPy array to zarr
zarr.save('array_L.zarr', array_L)

# get the size of the stored .zarr directory
! du -h array_L.zarr

>> 693M	array_L.zarr```

Storing the `array_L` as Zarr leads to a significant reduction in filesize (~15% for `array_L`), even with just the default out-of-the-box compression settings. Check out the accompanying notebook for more compression options you can tweak to increase performance.

You can load arrays stored as `.zarr` back into your Python session using `zarr.load()`.

```# load in array from zarr

It will be loaded into your Python session as a regular NumPy array.

```type(array_zarr)

>> numpy.ndarray```

Zarr supports multiple backend data stores. This means you can easily load `.zarr` files from cloud-based data stores, like Amazon S3, directly into your Python session:

```# load small zarr array from S3

## Read and Write NumPy Arrays in Parallel with Zarr and Dask

If you’re working with data stored in the cloud, chances are that your data is larger than your local machine RAM. In that case, you can use Dask to read and write your large Zarr arrays in parallel.

Below we’ll work through an example that loads a 370GB .zarr file into our Python session directly:

`array_XL = zarr.load("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")`

This fails. Python will throw the following error:

```---------------------------------------------------------------------------
MemoryError          Traceback (most recent call last)
<ipython-input-5-7969a01a46fb> in <module>

361     store = normalize_store_arg(store)
362     if contains_array(store, path=None):
--> 363         return Array(store=store, path=None)[...]
364     elif contains_group(store, path=None):
365         grp = Group(store=store, path=None)

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
671
672         fields, selection = pop_fields(selection)
--> 673         return self.get_basic_selection(selection, fields=fields)
674
675     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
797         else:
798             return self._get_basic_selection_nd(selection=selection, out=out,
--> 799                                                 fields=fields)
800
801     def _get_basic_selection_zd(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_nd(self, selection, out, fields)
839         indexer = BasicIndexer(selection, self)
840
--> 841         return self._get_selection(indexer=indexer, out=out, fields=fields)
842
843     def get_orthogonal_selection(self, selection, out=None, fields=None):

~/anaconda3/envs/tensorflow2_p37/lib/python3.7/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields)
1118         # setup output array
1119         if out is None:
-> 1120             out = np.empty(out_shape, dtype=out_dtype, order=self._order)
1121         else:
1122             check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 373. GiB for an array with shape (10000, 10000, 500) and data type float64```

```dask_array = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

This is because Dask evaluates lazily. The array is not read into memory until you specifically instruct Dask to perform a computation on the dataset.

This means you can perform some computations on this dataset locally. But loading the entire array into local RAM will still fail or cause Dask to spill to disk.

NOTE: Even if your machine may technically have the storage resources to spill this dataset to disk, this will significantly reduce performance.

## Scale Python to Dask Cluster with Coiled

We’ll need to run this on a remote cluster to access additional hardware resources.

To do this:

1. Spin up a Coiled cluster
```cluster = coiled.Cluster(
name="create-synth-array",
software="coiled-examples/numpy-zarr",
n_workers=50,
worker_cpu=4,
worker_memory="24Gib",
backend_options={'spot':'True'},
)```

2. Connect Dask to this cluster

```from distributed import Client
client = Client(cluster)```

3. And then run computations over the entire cluster comfortably.

```da_1 = da.from_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB.zarr")

da_2 = da_1.T

da_2```

```%%time
da_2.to_zarr("s3://coiled-datasets/synthetic-data/array-random-370GB-T.zarr")

CPU times: user 2.26 s, sys: 233 ms, total: 2.49 s
Wall time: 1min 32s```

Our Coiled cluster has 50 Dask workers with 24Gib RAM each, all running a pre-compiled software environment containing the necessary Python dependencies. This means we have enough resources to comfortably transpose the array and write it back to S3.

Dask is able to do all of this for us in parallel and without ever loading the array into our local memory. It has loaded, transformed, and saved an array of 372GB back to S3 in less than 2 minutes.

## Saving NumPy Arrays Summary

Let’s recap:

• There are important limitations to many of the storing NumPy arrays as a text file, csv file or npy file.
• The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays in parallel.
• Dask allows you to scale Python and use the parallel read/write capabilities of Zarr to their full potential.
• Connecting Dask to an on-demand Coiled cluster allows for lightning-fast computations over larger-than-memory datasets.