Replies: 5 comments 10 replies
-
It's important to distinguish between zarr the format and As for We are trying to make indexing faster in |
Beta Was this translation helpful? Give feedback.
-
I think the tensorstore tutorial is a good place to start. You would probably not want to use the format in that example (n5); instead, you can use tensorstore to read and write zarr v2 and v3 arrays (to write zarr groups you can just use |
Beta Was this translation helpful? Give feedback.
-
Sooo 5 months later and I'm still here 😓 Since you mentioned you are using cloud storages for your zarr datasets I wondered which kind of cloud technology you use? Is it as simple as some S3 buckets or are you using anything that is tuned for latency and throughput? Is there anything you can recommend for the purpose of reading data from a network as fast as possible? |
Beta Was this translation helpful? Give feedback.
-
I just tried running the original example on my macbook using Zarr 3.0.5. I made a few changes
In this example, Zarr was about 4x slower than memmapped numpy. Memmapping is very efficient, so I'm not super surprised by this. full code
Interestingly, when I ran the exact same code from a jupyter notebook in VScode, the timing was almost identical!
I wonder if there is some clue here. |
Beta Was this translation helpful? Give feedback.
-
I dug into this example a bit more with the goal of understanding the impact of memmap. I wrote a very naive flat binary storage layer which uses regular filesystem calls codeimport numpy as np
from pathlib import Path
import time
from tqdm import tqdm
import os
import random
import zarr
import tempfile
import zarr.codecs
def write_numpy_dataset(data_path, raw_data):
Path(data_path).mkdir(parents=True, exist_ok=True)
for sample_name, sample_data in tqdm(raw_data.items()):
mmap = np.memmap(
f"{data_path}/{sample_name}.npy",
dtype=sample_data.dtype,
mode='w+',
shape=sample_data.shape
)
mmap[:] = sample_data[:]
mmap.flush()
def write_raw_dataset(data_path, raw_data):
for sample_name, sample_data in tqdm(raw_data.items()):
Path(f"{data_path}/{sample_name}").mkdir(parents=True, exist_ok=True)
for n in range(sample_data.shape[0]):
raw = sample_data[n].tobytes()
with open(f"{data_path}/{sample_name}/{n:02d}.raw", 'wb') as f:
f.write(raw)
def write_zarr_dataset(data_path, raw_data):
zarr_store = f'{data_path}/data.zarr'
zarr_group = zarr.group(zarr_store, zarr_format=3, overwrite=True)
for data_key, sample in tqdm(raw_data.items()):
a = zarr_group.create_array(
data_key,
dtype=sample.dtype,
shape=sample.shape,
chunks=(1,) + tuple(sample.shape[1:]),
compressors=None,
)
a[:] = sample
def load_samples_numpy(data_path):
numpy_files = [f"{data_path}/{f}" for f in os.listdir(data_path)]
while True:
for numpy_file in numpy_files:
index = random.randint(0, 99)
# open a new memory map for every access. This way I will never cache samples in RAM
volume = np.memmap(
numpy_file,
mode='r',
dtype=np.float32
).reshape([100, -1, 28, 28, 28])
yield volume[index]
def load_samples_raw(data_path):
sample_names = os.listdir(data_path)
while True:
for sample_name in sample_names:
index = random.randint(0, 99)
fname = f"{data_path}/{sample_name}/{index:02d}.raw"
with open(fname, 'rb') as f:
data = np.frombuffer(f.read(), dtype=np.float32)
data = data.reshape([-1, 28, 28, 28])
yield data
def load_samples_zarr(data_path):
zarr_store = f"{data_path}/data.zarr"
zarr_group = zarr.open_group(zarr_store, mode="r")
data_keys = [s for s in zarr_group]
while True:
for data_key in data_keys:
index = random.randint(0, 99)
yield zarr_group[data_key][index]
def benchmark(dataset_path, load_fn):
total_size = 0
start_time = time.time()
print("\nStarting benchmark")
print(f"reading from: {dataset_path}")
sample_sums = []
for i, sample in tqdm(enumerate(load_fn(dataset_path))):
if i == 2000:
break
total_size += sample.nbytes
#assert sample.shape == (1, 1, 28, 28, 28), f"Unexpected shape: {sample.shape}"
sample_sums.append(sample.sum()) # do some dummy calculation to force the data into RAM
end_time = time.time()
throughput = (total_size / (end_time - start_time)) / (1024 * 1024)
print(f"Throughput: {throughput:.2f} MB/s")
def main():
data_path_numpy = tempfile.TemporaryDirectory()
data_path_raw = tempfile.TemporaryDirectory()
data_path_zarr = tempfile.TemporaryDirectory()
raw_data = {
'sample0': np.random.normal(size=(100, 14, 28, 28, 28)).astype(np.float32),
'sample1': np.random.normal(size=(100, 16, 28, 28, 28)).astype(np.float32),
'sample2': np.random.normal(size=(100, 12, 28, 28, 28)).astype(np.float32),
'sample3': np.random.normal(size=(100, 11, 28, 28, 28)).astype(np.float32),
'sample4': np.random.normal(size=(100, 13, 28, 28, 28)).astype(np.float32),
'sample5': np.random.normal(size=(100, 19, 28, 28, 28)).astype(np.float32),
'sample6': np.random.normal(size=(100, 20, 28, 28, 28)).astype(np.float32),
'sample7': np.random.normal(size=(100, 15, 28, 28, 28)).astype(np.float32),
'sample8': np.random.normal(size=(100, 18, 28, 28, 28)).astype(np.float32),
'sample9': np.random.normal(size=(100, 17, 28, 28, 28)).astype(np.float32),
}
write_numpy_dataset(data_path_numpy.name, raw_data)
write_raw_dataset(data_path_raw.name, raw_data)
write_zarr_dataset(data_path_zarr.name, raw_data)
benchmark(data_path_numpy.name, load_samples_numpy)
benchmark(data_path_raw.name, load_samples_raw)
benchmark(data_path_zarr.name, load_samples_zarr)
if __name__ == "__main__":
main()
I feel that Zarr should be able to at least match the "Numpy raw" performance. I wonder if this is related to the unnecessary memory copies issue described in #2904. |
Beta Was this translation helpful? Give feedback.
-
I'm looking for ways to improve the performance of my dataloading pipeline and I found Zarr. To get an idea about throughput, I started a small benchmark script in python. To get a baseline I also run tests using numpy memory mapped arrays.
I'm working with 4D arrays which are quite large. One of my criterias is that I need to access them as a key-value store. From each value, I access randomly on the first axis.
I created some dummy arrays to test throughput.
Here is my complete benchmarking code that compares Zarr to accessing raw Numpy arrays on disk:
It turns out that accessing Numpy arrays outperforms Zarr by a factor of ~6-7
My maximum disk speed is 500MB/s and I reach roughly 400MB/s using numpy. With Zarr I see a throughput of ~50-60MB/s
This difference is so big that I feel like I must be missing something. I tried different chunk sizes and disabled compression completely. Still, Zarr never reaches a throughput that comes even close to Numpy's memory mapped arrays.
Does anyone have a hint on what I'm missing? Is Zarr generally slow for my usecase of accessing large 4D arrays?
Appreciate any help
Beta Was this translation helpful? Give feedback.
All reactions