Skip to content

Yet another labels-to-Child lookup optimization #460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

njhill
Copy link

@njhill njhill commented Feb 23, 2019

This is an optimization of the SimpleCollector.labels(...) lookups with a similar goal to #445 and #459.

It has some things in common with those PRs (including overridden fixed-args versions) but aims to provide best of all worlds - zero garbage and higher throughput for all label counts, without any reliance on thread reuse.

To achieve this, ConcurrentHashMap is abandoned in favour of a custom copy-on-write linear-probe hashtable.

Benchmark results:

Before:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84731357.558 ±  535745.023  ops/s
oneLabel     thrpt   20  36415789.294 ±  441116.974  ops/s
twoLabels    thrpt   20  33301282.259 ±  313669.132  ops/s
threeLabels  thrpt   20  24560630.904 ± 2247040.286  ops/s
fourLabels   thrpt   20  24424456.896 ±  288989.596  ops/s
fiveLabels   thrpt   20  18356036.944 ±  949244.712  ops/s

After:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84866162.495 ±  823753.503  ops/s
oneLabel     thrpt   20  84554174.645 ±  804735.949  ops/s
twoLabels    thrpt   20  85004332.529 ±  689559.035  ops/s
threeLabels  thrpt   20  73395533.440 ± 3022384.940  ops/s
fourLabels   thrpt   20  68736143.734 ± 1872048.923  ops/s
fiveLabels   thrpt   20  53482207.003 ±  488751.990  ops/s

This benchmark, like the prior ones, only tests with a single sequence of labels for each count. It would be good to extend it to cover cases where the map is populated with a larger number of children.

@njhill njhill force-pushed the label-speedup branch 3 times, most recently from 54e2e97 to 8ab5802 Compare February 23, 2019 06:29
@brian-brazil
Copy link
Contributor

Thanks for the PR. I think a custom hash table implementation is going a bit far, I'd prefer not to have to maintain that. Could improvements be made to the JVM's implementations that everyone would benefit from?

@njhill
Copy link
Author

njhill commented Feb 24, 2019

@brian-brazil conceptually the implementation is really quite simple, just happens to be tailored for the specific requirements here including not having to create a new object just for the purpose of hash/equals comparison, and not caring about modification performance relative to lookups. The java SDK includes various data structure implementations (such as ConcurrentHashMap), but these are aimed at broad variety of use cases / workloads, both in terms of the selection of classes included and in terms of how those classes are implemented.

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

Impl-wise what's done here is kind of a combination of the JDK's IdentityHashMap and CopyOnWriteArrayList. I wrote it quite fast and can push an update to structure in a more readable way. It's also possible to de-duplicate much of the similar logic in the various labels(...) methods, but not so easy to do so without sacrificing some perf, so I'm still looking at that.

Given some of the reasons metrics are collected in the first place, I expect it's commonly done in performance sensitive contexts. Hence imho a first-order goal of a client library such as this should be minimal runtime overhead. This change eliminates all garbage on individual events and improves throughput by 2-3x.

Maintenance-wise I really don't feel it would be much of an issue given the simplicity of the data structure and the fact that this project is open source with many eyes on it.

@franz1981
Copy link

@brian-brazil I can close my #445 to favour this one from @njhill 👍
I think his approach is much better!!
And about

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

I vote for JCtools but I'm just biased on it :)

@brian-brazil
Copy link
Contributor

For this reason there's a wealth of third-party utility libraries which fill in gaps and cater to more specific use cases for example the excellent eclipse collections and JCTools libraries.

Dependencies are purposefully avoided in this library, to avoid issues including it.

Maintenance-wise I really don't feel it would be much of an issue given the simplicity of the data structure and the fact that this project is open source with many eyes on it.

That doesn't mean that the implementation is correct. I prefer the other proposed approaches.

@njhill
Copy link
Author

njhill commented Feb 26, 2019

Thanks @brian-brazil @franz1981

Dependencies are purposefully avoided in this library, to avoid issues including it.

This makes a lot of sense, although if there was one which would be really useful you could still consider shading? But I wasn't actually suggesting any of those would necessarily be applicable here, just giving examples of how there are many specialized impls for different use cases which would never be added to the JDK itself.

That doesn't mean that the implementation is correct.

Could you elaborate on what you mean by correct here? I did actually find a small bug and have pushed another commit which fixes that, has some other refinements, and adds a bunch of comments explaining that "big" method. Given your comments, I was still thinking to restructure it a bit further so that the "map" impl is better encapsulated/separated from the existing logic.

I prefer the other proposed approaches.

I'd be interested to understand the reasoning behind this. This PR isn't much different to those in terms of amount of additional code. If you don't care about performance then why would you make any change at all? If you do care about performance then why would you not go with this approach that outperforms the others considerably (and eliminates all allocations on the hot path)?

FWIW here's new benchmarks on the latest update. I think the areformentioned hash "fix" may have actually improved things even more (note I ran on a machine with less noise this time so the baselines are higher too).

Before:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  91140429.290 ± 608950.114  ops/s
oneLabel     thrpt   20  39409899.908 ± 836615.183  ops/s
twoLabels    thrpt   20  35696861.007 ± 480466.417  ops/s
threeLabels  thrpt   20  28408385.256 ± 461281.284  ops/s
fourLabels   thrpt   20  26435543.820 ± 193723.507  ops/s
fiveLabels   thrpt   20  24995175.241 ± 996512.887  ops/s

After:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  91230169.066 ± 519996.338  ops/s
oneLabel     thrpt   20  90727727.142 ± 765178.475  ops/s
twoLabels    thrpt   20  90294849.112 ± 939603.445  ops/s
threeLabels  thrpt   20  81616654.625 ± 895278.274  ops/s
fourLabels   thrpt   20  75852544.377 ± 972598.185  ops/s
fiveLabels   thrpt   20  58665529.189 ± 443240.792  ops/s

@njhill
Copy link
Author

njhill commented Feb 26, 2019

For reference I ran the same benchmark on the other proposed impls:

#445:

Benchmark     Mode  Cnt         Score        Error  Units
baseline     thrpt   20  90736100.887 ± 297716.947  ops/s
oneLabel     thrpt   20  40977375.778 ± 836712.718  ops/s
twoLabels    thrpt   20  28164696.062 ± 511641.294  ops/s
threeLabels  thrpt   20  24182863.357 ± 175046.611  ops/s
fourLabels   thrpt   20  20844796.684 ± 400396.972  ops/s
fiveLabels   thrpt   20  19438768.621 ± 248130.659  ops/s

#459:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  90504144.191 ± 1150495.470  ops/s
oneLabel     thrpt   20  88479586.198 ±  520336.150  ops/s
twoLabels    thrpt   20  65928763.142 ±  464774.439  ops/s
threeLabels  thrpt   20  62229818.280 ±  662305.912  ops/s
fourLabels   thrpt   20  55868210.897 ±  524420.109  ops/s
fiveLabels   thrpt   20  22860553.903 ± 1128285.139  ops/s

@njhill
Copy link
Author

njhill commented Feb 26, 2019

And for completeness here's allocations from GC profile:

Current:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        48.000            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        48.000            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        88.000            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        88.000            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        96.000            B/op

This PR:

Benchmark                                     Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                 thrpt    5        ≈ 10⁻⁵            B/op
twoLabels:·gc.alloc.rate.norm                thrpt    5        ≈ 10⁻⁶            B/op
threeLabels:·gc.alloc.rate.norm              thrpt    5        ≈ 10⁻⁵            B/op
fourLabels:·gc.alloc.rate.norm               thrpt    5        ≈ 10⁻⁵            B/op
fiveLabels:·gc.alloc.rate.norm               thrpt    5        40.000            B/op

#445:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        ≈ 10⁻⁵            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        24.002            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        32.002            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        32.003            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        40.003            B/op

#459:

Benchmark                                      Mode  Cnt         Score   Error   Units
oneLabel:·gc.alloc.rate.norm                  thrpt    5        16.000            B/op
twoLabels:·gc.alloc.rate.norm                 thrpt    5        24.000            B/op
threeLabels:·gc.alloc.rate.norm               thrpt    5        24.000            B/op
fourLabels:·gc.alloc.rate.norm                thrpt    5        32.000            B/op
fiveLabels:·gc.alloc.rate.norm                thrpt    5        64.000            B/op

@brian-brazil
Copy link
Contributor

I'd be interested to understand the reasoning behind this.

They don't involve implementing a bespoke data structure, so should be more maintainable and more likely to be correct.

@njhill
Copy link
Author

njhill commented Feb 26, 2019

@brian-brazil you sure drive a hard bargain :)

As data structures go, it's about as simple as it gets. My code may be obfuscating this a bit but we are just putting the labels/child pairs into an array instead of a map. Lookups are just a straight loop over it, the hashcode of the labels is used to decide where to start looking from. The copy-on-write aspect just means that any time we are going to modify the array we copy it first and replace the whole thing with the modified copy. So again in terms of maintenance overhead I don't see how this would be much different to the other "specialized" logic being considered.

A few more questions (and thank you for continuing to humour me!):

  • Would it help if I restructured the code some more to make the logic more explicit/verbose?
  • Would it help to abstract the map interface such that either a CHM or the copy-on-write version could be used without affecting the rest of the existing logic? Then the "low overhead" version could be switched on via a system property for example while not being used by default - or if you did not want the code in there at all the sysprop could allow specifying a custom impl from a separate library.
  • Would it be worth soliciting others in the user/dev community to see if there's a strong appetite for this kind of improvement (maybe on the mailing list)?

brian-brazil added a commit that referenced this pull request Jun 19, 2019
This is an optimization of the SimpleCollector.labels(...) lookups with
a similar goal to prometheus#445 and prometheus#459.

It has some things in common with those PRs (including overridden
fixed-args versions) but aims to provide best of all worlds - zero
garbage and higher throughput for all label counts, without any reliance
on thread reuse.

To achieve this, ConcurrentHashMap is abandoned in favour of a custom
copy-on-write linear-probe hashtable.

Benchmark results

Before:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84731357.558 ±  535745.023  ops/s
oneLabel     thrpt   20  36415789.294 ±  441116.974  ops/s
twoLabels    thrpt   20  33301282.259 ±  313669.132  ops/s
threeLabels  thrpt   20  24560630.904 ± 2247040.286  ops/s
fourLabels   thrpt   20  24424456.896 ±  288989.596  ops/s
fiveLabels   thrpt   20  18356036.944 ±  949244.712  ops/s

After:

Benchmark     Mode  Cnt         Score         Error  Units
baseline     thrpt   20  84866162.495 ±  823753.503  ops/s
oneLabel     thrpt   20  84554174.645 ±  804735.949  ops/s
twoLabels    thrpt   20  85004332.529 ±  689559.035  ops/s
threeLabels  thrpt   20  73395533.440 ± 3022384.940  ops/s
fourLabels   thrpt   20  68736143.734 ± 1872048.923  ops/s
fiveLabels   thrpt   20  53482207.003 ±  488751.990  ops/s

This benchmark, like the prior ones, only tests with a single sequence
of labels for each count. It would be good to extend it to cover cases
where the map is populated with a larger number of children.

Signed-off-by: nickhill <[email protected]>
@brian-brazil
Copy link
Contributor

Closing in favour of #486

njhill added a commit to njhill/client_java that referenced this pull request Nov 22, 2019
prometheus#460 proposed an optimized zero-GC version of the child lookup logic
which was deemed too specialized for inclusion in the library.

Subjectively less complex alternatives were also proposed which provide
some but not as much performance/garbage improvement, and rely for
example on some per-thread overhead.

This PR aims to add a minimally-invasive mechanism to allow users to
plug in an implementation of their choice, so that performance sensitive
consumers can opt for minimal overhead without the core library having
to include the corresponding code.

Signed-off-by: nickhill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants