Faster true count using AVX2 and AVX512 instructions#6931
Faster true count using AVX2 and AVX512 instructions#6931robert3005 merged 6 commits intodevelopfrom
Conversation
44dacba to
1acfa00
Compare
Merging this PR will degrade performance by 50.39%
Performance Changes
Comparing Footnotes
|
|
Zen 5 after on Zen 3 (no avx512) after which made me realise there's discontinuity where the length threshold is reached due to feature detection. Need to figure out better structure |
ba177b2 to
f9e3725
Compare
|
before we merge this we need to figure out the slowdown due to feature detection |
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Benchmarks: TPC-DS SF=1 on NVMESummary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (1.081x ➖, 0↑ 45↓)
datafusion / vortex-compact (0.931x ➖, 36↑ 5↓)
datafusion / parquet (1.065x ➖, 0↑ 20↓)
duckdb / vortex-file-compressed (0.967x ➖, 19↑ 5↓)
duckdb / vortex-compact (0.987x ➖, 4↑ 5↓)
duckdb / parquet (0.935x ➖, 29↑ 1↓)
duckdb / duckdb (0.943x ➖, 15↑ 0↓)
Full attributed analysis
|
Benchmarks: PolarSignals ProfilingSummary
datafusion / vortex-file-compressed (1.012x ➖, 0↑ 0↓)
|
Benchmarks: TPC-H SF=10 on NVMESummary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (0.898x ✅, 11↑ 0↓)
datafusion / vortex-compact (0.902x ➖, 13↑ 0↓)
datafusion / parquet (0.924x ➖, 5↑ 0↓)
datafusion / arrow (0.871x ✅, 17↑ 0↓)
duckdb / vortex-file-compressed (0.909x ➖, 6↑ 0↓)
duckdb / vortex-compact (0.924x ➖, 3↑ 0↓)
duckdb / parquet (0.956x ➖, 0↑ 0↓)
duckdb / duckdb (0.958x ➖, 1↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMESummary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (1.005x ➖, 1↑ 0↓)
datafusion / parquet (1.001x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.998x ➖, 0↑ 0↓)
duckdb / parquet (0.998x ➖, 0↑ 0↓)
duckdb / duckdb (1.040x ➖, 0↑ 5↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on NVMESummary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (1.009x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.002x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 1↑ 1↓)
datafusion / arrow (0.997x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 0↑ 0↓)
duckdb / parquet (0.972x ➖, 5↑ 2↓)
duckdb / duckdb (1.003x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Summary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (1.168x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.096x ➖, 0↑ 5↓)
datafusion / parquet (1.021x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.977x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.962x ➖, 0↑ 0↓)
duckdb / parquet (0.977x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeSummary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (0.958x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.987x ➖, 0↑ 0↓)
datafusion / parquet (0.920x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.975x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.953x ➖, 1↑ 0↓)
duckdb / parquet (0.966x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Summary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (0.989x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.035x ➖, 0↑ 0↓)
datafusion / parquet (1.050x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.035x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.968x ➖, 0↑ 0↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Summary
VerdictNo clear signal
Statistical Summary
datafusion / vortex-file-compressed (1.066x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.057x ➖, 1↑ 2↓)
datafusion / parquet (0.978x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.002x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.034x ➖, 0↑ 0↓)
duckdb / parquet (1.053x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsSummary
VerdictNo clear signal
Statistical Summary
duckdb / vortex-file-compressed (0.952x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.978x ➖, 0↑ 0↓)
duckdb / parquet (0.970x ➖, 0↑ 0↓)
Full attributed analysis
|
f9e3725 to
970af97
Compare
Benchmarks: Random AccessSummary
unknown / unknown (0.901x ➖, 13↑ 0↓)
|
Benchmarks: CompressionSummary
unknown / unknown (0.960x ➖, 21↑ 0↓)
|
Signed-off-by: Robert Kruszewski <github@robertk.io>
970af97 to
d3c062d
Compare
|
This is a sampling problem, the one slowest run that triggers feature detection to run trips up the benchmark runner and causes it to run too few samples, i.e. 100 instead of 51200 thus measurements are not comparable. I have added call to x86 feature detection to remove that one slow run from the sampled runs and we get 51200 samples |
Signed-off-by: Robert Kruszewski <github@robertk.io>
Add faster true count using AVX2 and AVX512 intrinsics.
True count happens a lot in our codebase, it would definitely benefit from optimistaions