-
Notifications
You must be signed in to change notification settings - Fork 3.9k
aggmetric: acquire lock in label config evaluation in SQLMetric #147486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this. I gave some drive-by comments since I saw the PR pop up, but recognize it is a draft, so take them or leave them.
Reviewable status:
complete! 0 of 0 LGTMs obtained
-- commits
line 9 at r1:
Perhaps it is because I don't understand the code well, but this doesn't clearly convey to me why the past code as inadequate. I'm assuming it is because the getOrAddChild calls somehow depended on the label config which could change underneath us.
pkg/util/metric/aggmetric/agg_metric.go
line 247 at r1 (raw file):
// getOrAddChild returns the child metric for the given label values. If the child // doesn't exist, it creates a new one and adds it to the collection. func (sm *SQLMetric) getOrAddChild(f createChildMetricFunc, labelValues ...string) ChildMetric {
I would consider changing the name of this to getOrAddChildMuLocked and/or adding a comment saying it expects sm.mu to be held.
pkg/util/metric/aggmetric/agg_metric.go
line 273 at r1 (raw file):
var childMetric ChildMetric switch sm.labelConfig.Load() {
Looking at the existing code now, I'm not sure why this needs to be an atomic (is it for testing?). From what I can see the production code will always be holding sm.mu when mutating this value anyway, so perhaps we should put labelConfig in the mu struct and simplify the code?
71a73cd
to
63f161d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)
Previously, stevendanna (Steven Danna) wrote…
Perhaps it is because I don't understand the code well, but this doesn't clearly convey to me why the past code as inadequate. I'm assuming it is because the getOrAddChild calls somehow depended on the label config which could change underneath us.
Thanks for pointing it out! I have updated commit message with timeline.
pkg/util/metric/aggmetric/agg_metric.go
line 247 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I would consider changing the name of this to getOrAddChildMuLocked and/or adding a comment saying it expects sm.mu to be held.
updated the comment.
pkg/util/metric/aggmetric/agg_metric.go
line 273 at r1 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Looking at the existing code now, I'm not sure why this needs to be an atomic (is it for testing?). From what I can see the production code will always be holding sm.mu when mutating this value anyway, so perhaps we should put labelConfig in the mu struct and simplify the code?
In earlier iterations of feature development, we had thought of reducing scope of the lock and updates on labelConfig should be atomic. However, it is not adding much of a value to keep it as atomic and outside the mu struct. I have updated it now to keep it as part of mu struct.
63f161d
to
d0f80f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turnaround. This PR will need a test that fails without the change and passes afterwards.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @kyle-a-wong, and @stevendanna)
d0f80f0
to
a416cd7
Compare
a416cd7
to
309ece7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @kyle-a-wong, and @stevendanna)
-- commits
line 8 at r4:
This paragraph could use another edit. I think it's important to clarify where the race condition is, more specifically. What are the two pieces of data that are not in sync?
The issue is that the code expects labelConfig
on the SQLMetric
to always match up with the length of the labelValuesSlice
within each child. This is not guaranteed because we don't lock the metric object with its map during modification of labelConfig
. This manifests in the implementation of Each
because the code makes assumptions about the length of labelValuesSlice
on every child based on the value of labelConfig
on the parent.
I think there's also a bug that could manifest with just a single setting being set. There would be no panic, but you'd mistakenly export a db
label as an app
or vice versa.
pkg/util/metric/aggmetric/agg_metric_test.go
line 304 at r4 (raw file):
}) r.AddMetric(d) d.mu.labelConfig = uint64(metric.LabelConfigAppAndDB)
just take a look at the test code to make sure you don't need to lock anything. I know here there's only one goroutine accessing things.
309ece7
to
131730e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @dhartunian, @kyle-a-wong, and @stevendanna)
pkg/util/metric/aggmetric/agg_metric_test.go
line 304 at r4 (raw file):
Previously, dhartunian (David Hartunian) wrote…
just take a look at the test code to make sure you don't need to lock anything. I know here there's only one goroutine accessing things.
I have cross checked tests. Scope of registry and metric declaration is in single goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @arjunmahishi, @dhartunian, and @kyle-a-wong)
pkg/util/metric/aggmetric/agg_metric.go
line 178 at r5 (raw file):
type SQLMetric struct { mu struct { labelConfig uint64
You can make this type a metric.LabelConfig
and remove the casting.
Previously, We are evaluating label config and then accordingly passing label values to `getOrAddChild` method in SQLMetric. `getOrAddChild` method acquires lock and then get/add child. This is inadequate because we are evaluating label config before invoking `getOrAddChild`. This resulted in below issue get/add child is happening inside the lock: P0,T0: initialise metrics with labelConfig as `LabelConfigApp`. P0,T1: increment SQL counter with 1 is invoked. P0,T2: `getChildByLabelConfig` method evaluates labelConfig as `LabelConfigApp`. P0,T3: invokes `getOrAddChild` method with just app as parameter. P1,T4: `ReinitialiseChildMetrics` acquires the lock, clears existing child metrics, updates labelConfig as `LabelConfigAppAndDB` and release lock. P0,T5: `getOrAddChild` acquires the lock, inserts the new child (c1) with app and release lock. P2,T6: scrape metrics invokes `Each` method inside the lock. It expects 2 label values for the child as latest labelConfig is LabelConfigAppAndDB and tries to fetch 2 label values app and db. However, child c1 has single value (app) which will throw an error. It is happening because methods on SQLMetric expects length of `labelValuesSlice` of `ChildMetric` should match according to `labelConfig` value. This contract is broken in `getOrAddChild` as we don't lock the metric object with its children map during modification of `labelConfig`. This is reflected in `Each`'s implementation, where the code assumes that every child's `labelValuesSlice` has a length consistent with the parent's `labelConfig`. To address this, this patch makes sure that we are evaluating LabelConfig and get/add child metric inside the same lock. Epic: None Fixes: cockroachdb#147475 Release note (bug fix): Concurrent invocation of child metric updates and metric reinitialisation will not result in error during scrape.
131730e
to
4517659
Compare
TFTR! |
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #147475: branch-release-25.2.1-rc. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, We are evaluating label config and then accordingly passing label
values to
getOrAddChild
method in SQLMetric.getOrAddChild
method acquireslock and then get/add child. This is inadequate because we are evaluating
label config before invoking
getOrAddChild
. This resulted in below issueget/add child is happening inside the lock:
P0,T0: initialise metrics with labelConfig as
LabelConfigApp
.P0,T1: increment SQL counter with 1 is invoked.
P0,T2:
getChildByLabelConfig
method evaluates labelConfig asLabelConfigApp
.P0,T3: invokes
getOrAddChild
method with just app as parameter.P1,T4:
ReinitialiseChildMetrics
acquires the lock, clears existing childmetrics, updates labelConfig as
LabelConfigAppAndDB
and releaselock.
P0,T5:
getOrAddChild
acquires the lock, inserts the new child (c1) with appand release lock.
P2,T6: scrape metrics invokes
Each
method inside the lock. It expects 2label values for the child as latest labelConfig is LabelConfigAppAndDB
and tries to fetch 2 label values app and db. However, child c1 has
single value (app) which will throw an error.
It is happening because methods on SQLMetric expects length of
labelValuesSlice
ofChildMetric
should match according tolabelConfig
value. This contract is broken in
getOrAddChild
as we don't lock the metricobject with its children map during modification of
labelConfig
. This isreflected in
Each
's implementation, where the code assumes that every child'slabelValuesSlice
has a length consistent with the parent'slabelConfig
.To address this, this patch makes sure that we are evaluating LabelConfig and
get/add child metric inside the same lock.
Epic: None
Fixes: #147475
Release note (bug fix): Concurrent invocation of child metric updates and
metric reinitialisation will not result in error during scrape.