Zarr (data format)
{{Short description|Storage format for multidimensional arrays}}
{{Infobox file format
| name = Zarr
| icon =
| iconcaption =
| icon_size =
| screenshot =
| screenshot_size =
| caption =
|_noextcode =
| extension = .zarr
|_nomimecode =
| mime =
| type_code =
| uniform_type =
| conforms_to =
| magic =
| max_size =
| developer =
| released =
| latest_release_version = 3
| latest_release_date =
| type = Multidimensional array
| container_for =
| contained_by =
| extended_from =
| extended_to =
| standard =
| free = Yes
| open = Yes
| url = {{URL|https://zarr.dev/}}
}}
Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks.
{{cite web
| title = Zarr - chunked, compressed, N-dimensional arrays
| url = https://zarr.dev/
| website = zarr.dev
| access-date = 2024-09-12}}
{{cite web
| title = Cloud-Optimized Geospatial Formats Guide: Zarr
| url = https://guide.cloudnativegeo.org/zarr/intro.html
| website = guide.cloudnativegeo.org
| access-date = 2024-09-12}}
Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia.
{{cite web
| title = Zarr Implementations
| url = https://zarr.dev/implementations/
| website = zarr.dev
| access-date = 2025-01-09}}
It has been used by organizations such as Google and Microsoft to publish large datasets.
{{cite web
| title = Google Cloud: ERA5 data
| url = https://cloud.google.com/storage/docs/public-datasets/era5
| website = cloud.google.com
| access-date = 2024-09-12}}
{{cite web
| title = Microsoft Planetary Computer: Reading Zarr Data
| url = https://planetarycomputer.microsoft.com/docs/quickstarts/reading-zarr-data/
| website = planetarycomputer.microsoft.com
| access-date = 2024-09-12}}
Early versions of Zarr were first released in 2015 by Alistair Miles.
{{cite web
|url=https://pypi.org/project/zarr/#history
|title=zarr - PyPI
|access-date=2025-02-10
}}
{{cite web
|url=https://alimanfoo.github.io/2016/04/14/to-hdf5-and-beyond.html
|title=To HDF5 and beyond
|date=2016-04-14
|access-date=2025-02-10
|author=Alistair Miles
}}
Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel.
{{cite web
| title = Zarr - Tutorial
| url = https://zarr.readthedocs.io/en/stable/tutorial.html
| website = zarr.readthedocs.io
| access-date = 2024-09-12}}
Format description
The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.
File:Zarr-scipy2019-storage.png
Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.
Applications
For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions.
{{cite journal
| author = Moore, Josh
| title = OME-Zarr: a cloud-optimized bioimaging file format with international community support
| journal = Histochemistry and Cell Biology
| volume = 160
| issue = 3
| publisher = Springer Science and Business Media LLC
| date = 2023
| pages = 223–251
| issn = 1432-119X
| doi = 10.1007/s00418-023-02209-1
| pmid = 37428210
| pmc = 10492740
| hdl = 1721.1/151126
| hdl-access = free
}}
Similarly, Zarr is being used to publish weather and satellite data
{{cite web
| title = Lazy loading: Making it easier to access vast datasets of weather & satellite data
| url = https://openclimatefix.org/post/lazy-loading-making-it-easier-to-access-vast-datasets-of-weather-satellite-data
| website = openclimatefix.org
| access-date = 2024-09-12}}
{{cite journal
| last1 = Sansal
| first1 = Altay
| last2 = Kainkaryam
| first2 = Sribharath
| last3 = Lasscock
| first3 = Ben
| last4 = Valenciano
| first4 = Alejandro
| author-link =
| title = MDIO: Open-source format for multidimensional energy data
| journal = The Leading Edge
| volume = 42
| issue = 7
| pages = 465–473
| publisher = Society of Exploration Geophysicists
| date = 2023
| issn = 1938-3789
| doi = 10.1190/tle42070465.1
| bibcode = 2023LeaEd..42..465S
}}
among others.
See also
References
{{reflist}}