GH-3226: Improve performance of InternalParquetRecordReader (1%)#3227
GH-3226: Improve performance of InternalParquetRecordReader (1%)#3227jerolba wants to merge 1 commit intoapache:masterfrom
Conversation
and use an ad-hoc Long Iterator
|
I ran your benchmark project locally, but actually got an opposite conclusion |
|
Yes, as I said in he PR, it's sensitive to the context. I was unable to obtain deterministic results. Even running the same test twice consecutively produced different execution times. After multiple executions, the |
|
I ran the benchmark using different JDKs, it looks like your assertion is almost true in the lower versions of JDK (I tested 8, 17), but in new versions(I tested 21), long-stream is a little bit faster than long-iterator. So I believe that new JDKs have some optimization for cases you mentioned, if so, we should keep the code as-is for future-proofing. |
Avoid LongStream reading files and use an ad-hoc Long Iterator
Rationale for this change
Profiling the load of a Parquet file with Java Mission Control, I've noticed that InternalParquetRecordReader LongStream consumes relevant amount of time:
This LongStream can be replaced with a simpler Long Iterator that iterates from 0 to pages.getRowCount().
To measure the overhead I've created a test project that overwrites
InternalParquetRecordReaderclass using a Long Iterator (the same change than proposed in the PR)The execution time is sensitive to the context of the JVM, but running the benchmark multiple times shows that LongStream is slower than LongIterator, between 1% and 4% depending on the run.
What changes are included in this PR?
A new
LongIteratorthat implementsPrimitiveIterator.OfLongand replaces aLongStream.range(0, pages.getRowCount()).iterator()Are these changes tested?
Not directly, but it's covered by existing tests
Are there any user-facing changes?
No
Closes #3226