Skip to content

Clean up parallel code implementation using zarrs. #167

@dblodgett-usgs

Description

@dblodgett-usgs

Delegate parallel I/O to zarrs via extendr

Problem

pizzarr's parallel I/O is a cross-platform mess. get_parallel_settings() dispatches across six closures to accommodate the matrix of future / parallel::cluster / mclapply × progress bar × OS. R6 objects can't serialize across PSOCK workers, so chunk_getitem is split into part1/part2 — an artifact of R's parallelism model, not the problem itself. crul::HttpClient fails inside workers, so HttpStore carries an AsyncVaried workaround plus a fallback that guesses whether HTTP handle errors came from a worker even when the parallel option didn't propagate (#128). And there's no S3 support at all.

Proposed solution

Delegate chunk I/O, codec execution, and store abstraction to the zarrs Rust crate via extendr. zarrs handles parallel chunk fetch and decode through its own rayon thread pool — parallelism becomes invisible to R. The R6 class hierarchy, indexers, slicing logic, and NestedArray stay in R; only the chunk loop and store I/O move to Rust.

The R-native single-threaded path (DirectoryStore, MemoryStore, plain HttpStore via lapply) is retained permanently as the dependency-free baseline. zarrs sits beside it as the performance and capability tier. This is not a fallback — it is the deliberate split between the simple path and the fast path.

Build and distribution

zarrs 0.23.x requires Rust edition 2024 / rust-version = "1.91". As of April 2026, CRAN's macOS build machines ship rustc 1.84.1 — too old. Rather than pin an older zarrs, we target the latest stable zarrs and defer the CRAN Rust build until CRAN catches up.

Two tiers:

  • CRAN tier — Pure R, no Rust compilation. The CRAN tarball has no src/ directory. A tools/cran-build.sh script produces the submission tarball by stripping Rust artifacts from a repo copy. The package builds and works identically to v0.1.x — all I/O uses the R-native code path.
  • r-universe tier — Full zarrs via pre-built binaries: filesystem, HTTP, S3, GCS, gzip, blosc, zstd, sharding. No Rust toolchain needed for end users. r-universe builds from the full repo against the latest stable rustc.

End users install from CRAN for the baseline or from r-universe for parallel I/O and cloud storage. No environment variables, no source-build gymnastics — two channels, two audiences.

When CRAN's macOS toolchain reaches rustc ≥ 1.91, we can vendor the zarrs dependency tree and submit a Rust-enabled CRAN build. The architecture is designed for this — it just doesn't ship yet.

Sequencing

Phases 1–4 introduce zarrs alongside the existing R code: extendr scaffolding, read/write paths, r-universe CI, remote store support. The R package passes R CMD check both with Rust (full repo) and without (CRAN tarball from cran-build.sh).

Phase 5 strips the R parallel infrastructure — get_parallel_settings(), part1/part2, the HttpStore workaround, and the pbapply / future / parallel Suggests. Ships to CRAN as a pure-R update. Users never lose parallel capability; they get a better version of it via r-universe.

Design spec: TODO.md. Rust coding conventions: RUST-STYLE.md.

Use cases by tier

  1. Local read with parallel decompression — r-universe (zarrs); CRAN (R-native sequential)
  2. HTTPS read — CRAN: R-native sequential; r-universe: zarrs parallel + pooled
  3. S3/GCS read+write — r-universe only (object_store)
  4. Local write — r-universe (zarrs); CRAN (R-native)
  5. zarr-python codec compatibility — full on r-universe; gzip+blosc on CRAN (R-native codecs)

Implementation

Phase 1 — scaffolding and metadata:

  • extendr setup, Cargo.toml targeting zarrs 0.23.x, Makevars, configure script
  • tools/cran-build.sh for CRAN tarball production
  • .onLoad availability probe (is_zarrs_available())
  • Store$get_store_identifier() for dispatch
  • Rust functions: zarrs_compiled_features, zarrs_runtime_info, zarrs_set_codec_concurrent_target, zarrs_open_array_metadata, zarrs_node_exists, zarrs_close_store

Phase 2 — read path:

  • zarrs_retrieve_subset with two-step dtype dispatch (retrieve as stored type, widen in Rust)
  • selection_to_ranges() bridge from pizzarr indexers to zarrs ArraySubset ranges
  • Dispatch from ZarrArray$get_item; benchmark against R-native path

Phase 3 — write path:

  • zarrs_store_subset, zarrs_create_array with codec presets (gzip/blosc/zstd)
  • Round-trip read-write tests on local stores

Phase 4 — r-universe and remote stores:

  • r-universe CI for --features full,s3,gcs
  • HTTP reads via zarrs_http and object_store; S3 reads against public bucket
  • Publish r-universe binaries

Phase 5 — simplify R-native:

  • Strip get_parallel_settings(), part1/part2, parallel Suggests, HttpStore workaround
  • Collapse R-native chunk loop to lapply
  • Updated vignettes, install guide, migration notes
  • Ship to CRAN as pure-R update

Tradeoffs

Two code paths permanently, by design. Two distribution tiers (CRAN = pure R, r-universe = zarrs binaries). No Rust on CRAN until macOS toolchain catches up. zarrs is under active development and the extendr bridge needs to stay thin to absorb upstream changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions