GH-3226: Improve performance of InternalParquetRecordReader (1%) by jerolba · Pull Request #3227 · apache/parquet-java

jerolba · 2025-05-25T17:25:24Z

Avoid LongStream reading files and use an ad-hoc Long Iterator

Rationale for this change

Profiling the load of a Parquet file with Java Mission Control, I've noticed that InternalParquetRecordReader LongStream consumes relevant amount of time:

This LongStream can be replaced with a simpler Long Iterator that iterates from 0 to pages.getRowCount().

To measure the overhead I've created a test project that overwrites InternalParquetRecordReader class using a Long Iterator (the same change than proposed in the PR)

The execution time is sensitive to the context of the JVM, but running the benchmark multiple times shows that LongStream is slower than LongIterator, between 1% and 4% depending on the run.

What changes are included in this PR?

A new LongIterator that implements PrimitiveIterator.OfLong and replaces a LongStream.range(0, pages.getRowCount()).iterator()

Are these changes tested?

Not directly, but it's covered by existing tests

Are there any user-facing changes?

No

Closes #3226

and use an ad-hoc Long Iterator

pan3793 · 2025-06-03T07:08:54Z

I ran your benchmark project locally, but actually got an opposite conclusion

$ java -version
openjdk version "21.0.3" 2024-04-16 LTS
OpenJDK Runtime Environment Zulu21.34+19-CA (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Zulu21.34+19-CA (build 21.0.3+9-LTS, mixed mode, sharing)

CPU: Intel i5-9500 (6) @ 4.400GHz
Ubuntu 24.04 Linux Kernel version 6.12.10

> Task :long-iterator:run
Using Long Iterator
Fetching file
Reading TripNarrow from parquet file...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
20405666 1733
20405666 1854
20405666 1852
20405666 1838
20405666 1846
20405666 1846
20405666 1847
20405666 1842
20405666 1843
20405666 1836
Total time 18337

> Task :long-stream:run
Using Long Stream
Reading TripNarrow from parquet file...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
20405666 1687
20405666 1783
20405666 1778
20405666 1790
20405666 1782
20405666 1803
20405666 1810
20405666 1778
20405666 1835
20405666 1816
Total time 17862

jerolba · 2025-06-03T07:55:47Z

Yes, as I said in he PR, it's sensitive to the context. I was unable to obtain deterministic results. Even running the same test twice consecutively produced different execution times. After multiple executions, the long-iterator version was better more times than the long-stream version.
I decided to create the PR because from a theoretical perspective, Stream code is more computationally expensive than Iterator code, even when accounting for inlining and other JIT optimizations.

pan3793 · 2025-06-03T08:50:47Z

I ran the benchmark using different JDKs, it looks like your assertion is almost true in the lower versions of JDK (I tested 8, 17), but in new versions(I tested 21), long-stream is a little bit faster than long-iterator. So I believe that new JDKs have some optimization for cases you mentioned, if so, we should keep the code as-is for future-proofing.

apacheGH-3226: Avoid LongStream usage reading files

f6fd1d8

and use an ad-hoc Long Iterator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3226: Improve performance of InternalParquetRecordReader (1%)#3227

GH-3226: Improve performance of InternalParquetRecordReader (1%)#3227
jerolba wants to merge 1 commit intoapache:masterfrom
jerolba:GH-3226_improve_internal_parquet_record_reader_performance

jerolba commented May 25, 2025

Uh oh!

pan3793 commented Jun 3, 2025 •

edited

Loading

Uh oh!

jerolba commented Jun 3, 2025

Uh oh!

pan3793 commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jerolba commented May 25, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pan3793 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerolba commented Jun 3, 2025

Uh oh!

pan3793 commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pan3793 commented Jun 3, 2025 •

edited

Loading