Yet another labels-to-Child lookup optimization #460

njhill · 2019-02-23T05:25:26Z

This is an optimization of the SimpleCollector.labels(...) lookups with a similar goal to #445 and #459.

It has some things in common with those PRs (including overridden fixed-args versions) but aims to provide best of all worlds - zero garbage and higher throughput for all label counts, without any reliance on thread reuse.

To achieve this, ConcurrentHashMap is abandoned in favour of a custom copy-on-write linear-probe hashtable.

Benchmark results:

Before:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84731357.558 ±  535745.023  ops/s
oneLabel     thrpt   20  36415789.294 ±  441116.974  ops/s
twoLabels    thrpt   20  33301282.259 ±  313669.132  ops/s
threeLabels  thrpt   20  24560630.904 ± 2247040.286  ops/s
fourLabels   thrpt   20  24424456.896 ±  288989.596  ops/s
fiveLabels   thrpt   20  18356036.944 ±  949244.712  ops/s

After:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84866162.495 ±  823753.503  ops/s
oneLabel     thrpt   20  84554174.645 ±  804735.949  ops/s
twoLabels    thrpt   20  85004332.529 ±  689559.035  ops/s
threeLabels  thrpt   20  73395533.440 ± 3022384.940  ops/s
fourLabels   thrpt   20  68736143.734 ± 1872048.923  ops/s
fiveLabels   thrpt   20  53482207.003 ±  488751.990  ops/s

This benchmark, like the prior ones, only tests with a single sequence of labels for each count. It would be good to extend it to cover cases where the map is populated with a larger number of children.

brian-brazil · 2019-02-23T11:22:08Z

Thanks for the PR. I think a custom hash table implementation is going a bit far, I'd prefer not to have to maintain that. Could improvements be made to the JVM's implementations that everyone would benefit from?

njhill · 2019-02-24T08:17:00Z

@brian-brazil conceptually the implementation is really quite simple, just happens to be tailored for the specific requirements here including not having to create a new object just for the purpose of hash/equals comparison, and not caring about modification performance relative to lookups. The java SDK includes various data structure implementations (such as ConcurrentHashMap), but these are aimed at broad variety of use cases / workloads, both in terms of the selection of classes included and in terms of how those classes are implemented.

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

Impl-wise what's done here is kind of a combination of the JDK's IdentityHashMap and CopyOnWriteArrayList. I wrote it quite fast and can push an update to structure in a more readable way. It's also possible to de-duplicate much of the similar logic in the various labels(...) methods, but not so easy to do so without sacrificing some perf, so I'm still looking at that.

Given some of the reasons metrics are collected in the first place, I expect it's commonly done in performance sensitive contexts. Hence imho a first-order goal of a client library such as this should be minimal runtime overhead. This change eliminates all garbage on individual events and improves throughput by 2-3x.

Maintenance-wise I really don't feel it would be much of an issue given the simplicity of the data structure and the fact that this project is open source with many eyes on it.

franz1981 · 2019-02-25T12:58:31Z

@brian-brazil I can close my #445 to favour this one from @njhill 👍
I think his approach is much better!!
And about

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

I vote for JCtools but I'm just biased on it :)

brian-brazil · 2019-02-25T14:46:01Z

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

Dependencies are purposefully avoided in this library, to avoid issues including it.

Maintenance-wise I really don't feel it would be much of an issue given the simplicity of the data structure and the fact that this project is open source with many eyes on it.

That doesn't mean that the implementation is correct. I prefer the other proposed approaches.

njhill · 2019-02-26T02:46:34Z

Thanks @brian-brazil @franz1981

Dependencies are purposefully avoided in this library, to avoid issues including it.

This makes a lot of sense, although if there was one which would be really useful you could still consider shading? But I wasn't actually suggesting any of those would necessarily be applicable here, just giving examples of how there are many specialized impls for different use cases which would never be added to the JDK itself.

That doesn't mean that the implementation is correct.

Could you elaborate on what you mean by correct here? I did actually find a small bug and have pushed another commit which fixes that, has some other refinements, and adds a bunch of comments explaining that "big" method. Given your comments, I was still thinking to restructure it a bit further so that the "map" impl is better encapsulated/separated from the existing logic.

I prefer the other proposed approaches.

I'd be interested to understand the reasoning behind this. This PR isn't much different to those in terms of amount of additional code. If you don't care about performance then why would you make any change at all? If you do care about performance then why would you not go with this approach that outperforms the others considerably (and eliminates all allocations on the hot path)?

FWIW here's new benchmarks on the latest update. I think the areformentioned hash "fix" may have actually improved things even more (note I ran on a machine with less noise this time so the baselines are higher too).

Before:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  91140429.290 ± 608950.114  ops/s
oneLabel     thrpt   20  39409899.908 ± 836615.183  ops/s
twoLabels    thrpt   20  35696861.007 ± 480466.417  ops/s
threeLabels  thrpt   20  28408385.256 ± 461281.284  ops/s
fourLabels   thrpt   20  26435543.820 ± 193723.507  ops/s
fiveLabels   thrpt   20  24995175.241 ± 996512.887  ops/s

After:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  91230169.066 ± 519996.338  ops/s
oneLabel     thrpt   20  90727727.142 ± 765178.475  ops/s
twoLabels    thrpt   20  90294849.112 ± 939603.445  ops/s
threeLabels  thrpt   20  81616654.625 ± 895278.274  ops/s
fourLabels   thrpt   20  75852544.377 ± 972598.185  ops/s
fiveLabels   thrpt   20  58665529.189 ± 443240.792  ops/s

njhill · 2019-02-26T04:26:40Z

For reference I ran the same benchmark on the other proposed impls:

#445:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  90736100.887 ± 297716.947  ops/s
oneLabel     thrpt   20  40977375.778 ± 836712.718  ops/s
twoLabels    thrpt   20  28164696.062 ± 511641.294  ops/s
threeLabels  thrpt   20  24182863.357 ± 175046.611  ops/s
fourLabels   thrpt   20  20844796.684 ± 400396.972  ops/s
fiveLabels   thrpt   20  19438768.621 ± 248130.659  ops/s

#459:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  90504144.191 ± 1150495.470  ops/s
oneLabel     thrpt   20  88479586.198 ±  520336.150  ops/s
twoLabels    thrpt   20  65928763.142 ±  464774.439  ops/s
threeLabels  thrpt   20  62229818.280 ±  662305.912  ops/s
fourLabels   thrpt   20  55868210.897 ±  524420.109  ops/s
fiveLabels   thrpt   20  22860553.903 ± 1128285.139  ops/s

njhill · 2019-02-26T05:23:28Z

And for completeness here's allocations from GC profile:

Current:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        48.000            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        48.000            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        88.000            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        88.000            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        96.000            B/op

This PR:

Benchmark                                     Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                 thrpt    5        ≈ 10⁻⁵            B/op
twoLabels:·gc.alloc.rate.norm                thrpt    5        ≈ 10⁻⁶            B/op
threeLabels:·gc.alloc.rate.norm              thrpt    5        ≈ 10⁻⁵            B/op
fourLabels:·gc.alloc.rate.norm               thrpt    5        ≈ 10⁻⁵            B/op
fiveLabels:·gc.alloc.rate.norm               thrpt    5        40.000            B/op

#445:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        ≈ 10⁻⁵            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        24.002            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        32.002            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        32.003            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        40.003            B/op

#459:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        16.000            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        24.000            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        24.000            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        32.000            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        64.000            B/op

brian-brazil · 2019-02-26T15:15:02Z

I'd be interested to understand the reasoning behind this.

They don't involve implementing a bespoke data structure, so should be more maintainable and more likely to be correct.

njhill · 2019-02-26T20:06:12Z

@brian-brazil you sure drive a hard bargain :)

As data structures go, it's about as simple as it gets. My code may be obfuscating this a bit but we are just putting the labels/child pairs into an array instead of a map. Lookups are just a straight loop over it, the hashcode of the labels is used to decide where to start looking from. The copy-on-write aspect just means that any time we are going to modify the array we copy it first and replace the whole thing with the modified copy. So again in terms of maintenance overhead I don't see how this would be much different to the other "specialized" logic being considered.

A few more questions (and thank you for continuing to humour me!):

Would it help if I restructured the code some more to make the logic more explicit/verbose?
Would it help to abstract the map interface such that either a CHM or the copy-on-write version could be used without affecting the rest of the existing logic? Then the "low overhead" version could be switched on via a system property for example while not being used by default - or if you did not want the code in there at all the sysprop could allow specifying a custom impl from a separate library.
Would it be worth soliciting others in the user/dev community to see if there's a strong appetite for this kind of improvement (maybe on the mailing list)?

Add benchmarks from #460

This is an optimization of the SimpleCollector.labels(...) lookups with a similar goal to prometheus#445 and prometheus#459. It has some things in common with those PRs (including overridden fixed-args versions) but aims to provide best of all worlds - zero garbage and higher throughput for all label counts, without any reliance on thread reuse. To achieve this, ConcurrentHashMap is abandoned in favour of a custom copy-on-write linear-probe hashtable. Benchmark results Before: Benchmark Mode Cnt Score Error Units baseline thrpt 20 84731357.558 ± 535745.023 ops/s oneLabel thrpt 20 36415789.294 ± 441116.974 ops/s twoLabels thrpt 20 33301282.259 ± 313669.132 ops/s threeLabels thrpt 20 24560630.904 ± 2247040.286 ops/s fourLabels thrpt 20 24424456.896 ± 288989.596 ops/s fiveLabels thrpt 20 18356036.944 ± 949244.712 ops/s After: Benchmark Mode Cnt Score Error Units baseline thrpt 20 84866162.495 ± 823753.503 ops/s oneLabel thrpt 20 84554174.645 ± 804735.949 ops/s twoLabels thrpt 20 85004332.529 ± 689559.035 ops/s threeLabels thrpt 20 73395533.440 ± 3022384.940 ops/s fourLabels thrpt 20 68736143.734 ± 1872048.923 ops/s fiveLabels thrpt 20 53482207.003 ± 488751.990 ops/s This benchmark, like the prior ones, only tests with a single sequence of labels for each count. It would be good to extend it to cover cases where the map is populated with a larger number of children. Signed-off-by: nickhill <[email protected]>

Signed-off-by: nickhill <[email protected]>

brian-brazil · 2019-11-14T13:01:43Z

Closing in favour of #486

prometheus#460 proposed an optimized zero-GC version of the child lookup logic which was deemed too specialized for inclusion in the library. Subjectively less complex alternatives were also proposed which provide some but not as much performance/garbage improvement, and rely for example on some per-thread overhead. This PR aims to add a minimally-invasive mechanism to allow users to plug in an implementation of their choice, so that performance sensitive consumers can opt for minimal overhead without the core library having to include the corresponding code. Signed-off-by: nickhill <[email protected]>

njhill force-pushed the label-speedup branch 3 times, most recently from 54e2e97 to 8ab5802 Compare February 23, 2019 06:29

njhill force-pushed the label-speedup branch from 38d2ba5 to c77a19a Compare February 26, 2019 02:47

brian-brazil added a commit that referenced this pull request Jun 19, 2019

Optimise allocs, and simplify a bit.

1ef6c78

Add benchmarks from #460

brian-brazil mentioned this pull request Jun 19, 2019

Zero gc labels lookup #486

Open

njhill added 3 commits November 5, 2019 17:21

Fix hash function, refine logic and add lots of comments

90d7465

Signed-off-by: nickhill <[email protected]>

Bump JMH version to latest 1.22

6e44620

njhill force-pushed the label-speedup branch from c77a19a to 6e44620 Compare November 6, 2019 02:04

brian-brazil closed this Nov 14, 2019

njhill mentioned this pull request Nov 22, 2019

Make Collector labels-to-child map implementation pluggable #514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Yet another labels-to-Child lookup optimization #460

Yet another labels-to-Child lookup optimization #460

Uh oh!

njhill commented Feb 23, 2019 •

edited

Loading

Uh oh!

brian-brazil commented Feb 23, 2019

Uh oh!

njhill commented Feb 24, 2019

Uh oh!

franz1981 commented Feb 25, 2019

Uh oh!

brian-brazil commented Feb 25, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

brian-brazil commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

brian-brazil commented Nov 14, 2019

Uh oh!

Uh oh!

Yet another labels-to-Child lookup optimization #460

Yet another labels-to-Child lookup optimization #460

Uh oh!

Conversation

njhill commented Feb 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-brazil commented Feb 23, 2019

Uh oh!

njhill commented Feb 24, 2019

Uh oh!

franz1981 commented Feb 25, 2019

Uh oh!

brian-brazil commented Feb 25, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

brian-brazil commented Feb 26, 2019

Uh oh!

njhill commented Feb 26, 2019

Uh oh!

brian-brazil commented Nov 14, 2019

Uh oh!

Uh oh!

njhill commented Feb 23, 2019 •

edited

Loading