Skip to content

TsFile C++ batch read/write optimization#823

Open
ColinLeeo wants to merge 1 commit into
developfrom
colin_all_opts
Open

TsFile C++ batch read/write optimization#823
ColinLeeo wants to merge 1 commit into
developfrom
colin_all_opts

Conversation

@ColinLeeo
Copy link
Copy Markdown
Contributor

@ColinLeeo ColinLeeo commented May 26, 2026

Summary

Brings together batch decode infrastructure, multi-value aligned read, parallel page decode, columnar tablet write, and SIMD micro-optimizations from the long-lived final branch into a single review-ready change.

This PR is a code snapshot, not a replay of final's commit history — the upstream history was a long sequence of WIP commits that wasn't fit for review. Squashed into a single commit on purpose.

Supersedes #749, #754, #774.

Changes by area

Read path

  • Batch decode infrastructure. Decoder base class gains read_batch_int32/int64/float/double and skip_* APIs; PLAIN, TS2DIFF, and Gorilla decoders implement them. TS2DIFF exposes block-level peeking so time filters can skip whole blocks without decoding. Gorilla adds a raw-pointer GorillaBitReader that bypasses ByteStream overhead in the hot loop.
  • TsBlock-level batch read. ChunkReader / AlignedChunkReader add *_DECODE_TV_BATCH methods that decode time + value into a TsBlock in one pass and apply batch time filters before append.
  • Multi-value aligned read. AlignedChunkReader supports one time chunk + N value chunks decoded in a single pass, sharing decoded timestamps and the filter mask. SingleDeviceTsBlockReader auto-detects same-device measurements via VectorMeasurementColumnContext.
  • Parallel page decode (opt-in). When ENABLE_THREADS is set a DecodeThreadPool + BlockingQueue can decompress and predecode pages in parallel. Page-plan classification (SKIP / FULL_PASS / BOUNDARY) lets a scatter-free memcpy fast path fire when every row passes and no column has nulls.

Write path

  • Batch write into pages. ValuePageWriter::write_batch / write_string_batch take timestamp + value + nullness arrays directly, removing the per-value append loop.
  • Columnar tablet. Tablet exposes set_timestamps, set_column_values, set_column_string_repeated, reset for bulk reuse, and switches StringColumn to an Arrow-compatible offset + buffer layout.
  • Batched bit-pack on TS2DIFF flush. TS2DIFFEncoder::flush packs all deltas with a single pack_bits_msb + write_buf instead of per-value write_bits, falling back to the scalar path for the rare bit_width > 56 case.
  • Statistics. Int64Statistic::update_batch adds NEON-accelerated min/max/sum.
  • Parallel write. g_config_value_.parallel_write_enabled_ controls whether TsFileWriter::write_table dispatches per-column ChunkWriter tasks to a thread pool. On by default.

Encoding / SIMD

  • TS2DIFF batch decode adds AVX2 helpers via SIMDe (already on develop from introduce simde as third-party dependency. #755) for both i32 and i64; scalar fallback unchanged.
  • PLAIN byte-swap path uses ARM NEON (vrev64q_u8 / vrev32q_u8) when available, with __builtin_bswap as fallback.
  • cpp/CMakeLists.txt adds ENABLE_SIMD and turns on -O3 -march=native -flto in Release.

Allocator / ByteStream

  • ByteStream caches page_mask_ (= page_size − 1) so the hot path uses a bitmask instead of modulo; wrap_from rounds buffer sizes up to a power of two so the mask remains correct. total_size_ widened to uint64_t to support files > 4 GB.
  • UncompressedCompressor now copies its output instead of aliasing caller buffers, letting callers free the input safely.

C wrapper / Arrow

  • Trimmed unused metadata-export surface (TsFileStatisticBase, TimeseriesMetadata, DeviceTimeseriesMetadataEntry, tag-filter handles) out of the public C API. Internal tag filtering is unaffected.
  • arrow_c.cc simplified: per-row offset handling for sliced variable-length arrays in place of the InvertArrowBitmap copy.

Correctness fixes uncovered while validating this PR

These were independent latent bugs surfaced (via ASan + TestNullInTable4 / WriteDataWithEmptyField) while bringing this change into a green state. All small, defensive, no behavior change for the happy path:

  • save_first_page_data double-free (all three ChunkWriters). get_cur_page_data() returns a shallow copy of PageData; save_first_page_data was caching that copy into first_page_data_ while the source PageWriter::cur_page_data_ kept the same compressed_buf_ / uncompressed_buf_ pointers. On destroy both holders called mem_free on the same allocation. Fixed by adding PageWriter::release_cur_page_data() (nulls source pointers without freeing) and calling it from each save_first_page_data after the copy.
  • SnappyCompressor / LZ4Compressor after_compress freed the wrong pointer. Both implementations did mem_free(compressed_buf_) (the cached member) instead of mem_free(compressed_buf) (the parameter the caller is releasing). When the same compressor is reused across pages, the member can lag behind the caller-known buffer; the wrong free would either nuke a still-live allocation or crash on mem_free(nullptr). Fixed to free the parameter and only null the member when it still matches.

Tests / benchmarks

  • New tsfile_reader_table_batch_test.cc covers the TsBlock batch read path.
  • gorilla_codec_test.cc adds Int32BatchDecode / Int64BatchDecode / FloatBatchDecode tests.
  • examples/cpp_examples/bench_read.cpp + .h and examples/read_perf_compare/ for benchmarking.
  • Removed cwrapper_metadata_test.cc (covered the removed C metadata API) and common/path.cc (Path member bodies inlined into path.h).

Compatibility notes

  • All new C++ methods are additions — no existing C++ API was removed.
  • C wrapper headers lose the metadata export / tag filter symbols listed above. Downstream callers (notably the Python wrapper) want a sanity check before merge.
  • cpp/third_party/ is intentionally left at develop's state so the recent MSVC compatibility fixes (WITH_STATIC_CRT OFF, CMP0054 NEW, CMAKE_POLICY_VERSION_MINIMUM=3.5, _MSC_VER guards) are preserved.

Verification

  • cmake configure + make -j on macOS arm64 (AppleClang, C++11) builds cleanly: libtsfile.2.2.1.dev.dylib and TsFile_Test both link, zero errors, only unused-lambda-capture warnings in pre-existing tests.
  • TsFile_Test full run on macOS arm64 (Release): 496/496 tests pass (~91 s).
  • A few tests show pre-existing test-fixture-state ordering effects (e.g. TsFileTableReaderTest.TestNullInTable3 fails when run alongside other TestNullInTable* cases but passes in the full suite). These are not regressions from this PR — they exist on final and develop both — but worth filing as a follow-up so isolated ctest -R runs are deterministic.

Test plan

  • Run TsFile_Test and confirm the existing suites still pass
  • Run new batch-read / batch-decode tests
  • Verify Python binding still loads and queries this libtsfile
  • Run the included bench_read against develop baseline; spot-check the throughput claims from Read Opt. #754
  • Cross-platform sanity (Linux + MSVC) once macOS review feedback is incorporated

🤖 Generated with Claude Code

Brings together batch decode infrastructure, multi-value aligned read,
parallel page decode, columnar tablet write, and SIMD micro-optimizations
from the long-lived `final` branch into a single review-ready change.

This change is a code snapshot, not a replay of `final` commit history --
the upstream history was a long sequence of WIP commits that wasn't
fit for review. Supersedes #749, #754, #774.

Read path
- Decoder base gains batch APIs (read_batch_int32/int64/float/double,
  skip_*); PLAIN, TS2DIFF, Gorilla decoders implement them. TS2DIFF
  has block-level peeking so time filters can skip blocks without
  decoding. Gorilla adds a raw-pointer GorillaBitReader that bypasses
  ByteStream overhead.
- ChunkReader / AlignedChunkReader add *_DECODE_TV_BATCH methods that
  decode time + value into a TsBlock in one pass, applying batch time
  filters before append.
- AlignedChunkReader supports a multi-value mode: one time chunk + N
  value chunks decoded in a single pass, sharing the decoded timestamps
  and filter mask. SingleDeviceTsBlockReader auto-detects same-device
  measurements via VectorMeasurementColumnContext.
- Optional page-level parallel decompression via a DecodeThreadPool +
  BlockingQueue when ENABLE_THREADS is set. Page-plan classification
  (SKIP / FULL_PASS / BOUNDARY) lets a scatter-free memcpy fast path
  fire when every row passes and no column has nulls.

Write path
- ValuePageWriter gains write_batch / write_string_batch that take
  timestamp+value+nullness arrays directly, removing the per-value
  append loop. Tablet exposes set_timestamps / set_column_values /
  set_column_string_repeated / reset for bulk reuse and switches
  StringColumn to an Arrow-compatible offset+buffer layout.
- TS2DIFFEncoder::flush now packs all deltas with a single
  pack_bits_msb + write_buf instead of per-value write_bits, falling
  back to the scalar path for the rare bit_width > 56 case.
- Int64Statistic::update_batch (NEON-accelerated min/max/sum).

Encoding / SIMD
- TS2DIFF batch decode adds AVX2 helpers via SIMDe (already on develop)
  for both i32 and i64; scalar fallback unchanged.
- PLAIN byte-swap path uses ARM NEON (vrev64q_u8 / vrev32q_u8) when
  available, falling back to __builtin_bswap.
- CMakeLists adds ENABLE_SIMD and turns on -O3 -march=native -flto in
  Release builds.

Allocator / ByteStream
- ByteStream caches page_mask_ (= page_size - 1) so the hot path uses
  a bitmask instead of modulo; wrap_from rounds buffer sizes up to a
  power of two so the mask remains correct. total_size_ widened to
  uint64_t to support files > 4GB.
- UncompressedCompressor now copies its output instead of aliasing
  caller buffers, letting callers free input safely.

C wrapper / Arrow
- Trimmed unused metadata-export surface (TsFileStatisticBase,
  TimeseriesMetadata, DeviceTimeseriesMetadataEntry, tag-filter handles)
  out of the public C API. Internal tag filtering is unaffected.
- arrow_c.cc simplified: per-row offset handling for sliced
  variable-length arrays in place of the InvertArrowBitmap copy.

Tests / benchmarks
- New tsfile_reader_table_batch_test.cc covers the TsBlock batch read
  path. gorilla_codec_test.cc adds Int32/Int64/Float batch decode
  tests. examples/cpp_examples adds bench_read.cpp/.h and an
  examples/read_perf_compare/ target.
- Removed cwrapper_metadata_test.cc and common/path.cc (Path bodies
  inlined into path.h; the C metadata API they covered is gone).

Compatibility
- All new C++ methods are additions; no existing C++ API was removed.
- C wrapper headers lost the metadata export / tag filter symbols
  listed above -- downstream callers (Python wrapper in particular)
  will want a sanity check before merge.
- cpp/third_party/ intentionally left at develop's state so the
  recent MSVC compatibility fixes (WITH_STATIC_CRT OFF, CMP0054 NEW,
  CMAKE_POLICY_VERSION_MINIMUM=3.5, _MSC_VER guards) are preserved.

Verification
- cmake configure + make -j on macOS arm64 (AppleClang, C++11) builds
  cleanly: libtsfile.2.2.1.dev.dylib and TsFile_Test both link, zero
  errors, only unused-lambda-capture warnings in pre-existing tests.
- Full TsFile_Test run and downstream Python binding load are left as
  pre-merge checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant