Skip to content

aggmetric: acquire lock in label config evaluation in SQLMetric #147486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025

Conversation

aa-joshi
Copy link
Contributor

@aa-joshi aa-joshi commented May 29, 2025

Previously, We are evaluating label config and then accordingly passing label
values to getOrAddChild method in SQLMetric. getOrAddChild method acquires
lock and then get/add child. This is inadequate because we are evaluating
label config before invoking getOrAddChild. This resulted in below issue
get/add child is happening inside the lock:

P0,T0: initialise metrics with labelConfig as LabelConfigApp.
P0,T1: increment SQL counter with 1 is invoked.
P0,T2: getChildByLabelConfig method evaluates labelConfig as
LabelConfigApp.
P0,T3: invokes getOrAddChild method with just app as parameter.
P1,T4: ReinitialiseChildMetrics acquires the lock, clears existing child
metrics, updates labelConfig as LabelConfigAppAndDB and release
lock.
P0,T5: getOrAddChild acquires the lock, inserts the new child (c1) with app
and release lock.
P2,T6: scrape metrics invokes Each method inside the lock. It expects 2
label values for the child as latest labelConfig is LabelConfigAppAndDB
and tries to fetch 2 label values app and db. However, child c1 has
single value (app) which will throw an error.

It is happening because methods on SQLMetric expects length of
labelValuesSlice of ChildMetric should match according to labelConfig
value. This contract is broken in getOrAddChild as we don't lock the metric
object with its children map during modification of labelConfig. This is
reflected in Each's implementation, where the code assumes that every child's
labelValuesSlice has a length consistent with the parent's labelConfig.

To address this, this patch makes sure that we are evaluating LabelConfig and
get/add child metric inside the same lock.

Epic: None
Fixes: #147475
Release note (bug fix): Concurrent invocation of child metric updates and
metric reinitialisation will not result in error during scrape.

Copy link

blathers-crl bot commented May 29, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this. I gave some drive-by comments since I saw the PR pop up, but recognize it is a draft, so take them or leave them.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


-- commits line 9 at r1:
Perhaps it is because I don't understand the code well, but this doesn't clearly convey to me why the past code as inadequate. I'm assuming it is because the getOrAddChild calls somehow depended on the label config which could change underneath us.


pkg/util/metric/aggmetric/agg_metric.go line 247 at r1 (raw file):

// getOrAddChild returns the child metric for the given label values. If the child
// doesn't exist, it creates a new one and adds it to the collection.
func (sm *SQLMetric) getOrAddChild(f createChildMetricFunc, labelValues ...string) ChildMetric {

I would consider changing the name of this to getOrAddChildMuLocked and/or adding a comment saying it expects sm.mu to be held.


pkg/util/metric/aggmetric/agg_metric.go line 273 at r1 (raw file):

	var childMetric ChildMetric
	switch sm.labelConfig.Load() {

Looking at the existing code now, I'm not sure why this needs to be an atomic (is it for testing?). From what I can see the production code will always be holding sm.mu when mutating this value anyway, so perhaps we should put labelConfig in the mu struct and simplify the code?

@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from 71a73cd to 63f161d Compare May 29, 2025 12:12
@aa-joshi aa-joshi requested a review from stevendanna May 29, 2025 12:21
Copy link
Contributor Author

@aa-joshi aa-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


-- commits line 9 at r1:

Previously, stevendanna (Steven Danna) wrote…

Perhaps it is because I don't understand the code well, but this doesn't clearly convey to me why the past code as inadequate. I'm assuming it is because the getOrAddChild calls somehow depended on the label config which could change underneath us.

Thanks for pointing it out! I have updated commit message with timeline.


pkg/util/metric/aggmetric/agg_metric.go line 247 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I would consider changing the name of this to getOrAddChildMuLocked and/or adding a comment saying it expects sm.mu to be held.

updated the comment.


pkg/util/metric/aggmetric/agg_metric.go line 273 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Looking at the existing code now, I'm not sure why this needs to be an atomic (is it for testing?). From what I can see the production code will always be holding sm.mu when mutating this value anyway, so perhaps we should put labelConfig in the mu struct and simplify the code?

In earlier iterations of feature development, we had thought of reducing scope of the lock and updates on labelConfig should be atomic. However, it is not adding much of a value to keep it as atomic and outside the mu struct. I have updated it now to keep it as part of mu struct.

@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from 63f161d to d0f80f0 Compare May 29, 2025 12:27
@aa-joshi aa-joshi added branch-release-25.2 backport-25.2.1-rc Monday, 06/02: release-25.2.1-rc will be frozen labels May 29, 2025
@aa-joshi aa-joshi marked this pull request as ready for review May 29, 2025 12:38
@aa-joshi aa-joshi requested review from a team as code owners May 29, 2025 12:38
@aa-joshi aa-joshi requested review from kyle-a-wong, arjunmahishi, Abhinav1299 and dhartunian and removed request for a team May 29, 2025 12:38
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround. This PR will need a test that fails without the change and passes afterwards.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @kyle-a-wong, and @stevendanna)

@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from d0f80f0 to a416cd7 Compare May 29, 2025 13:47
@aa-joshi aa-joshi self-assigned this May 29, 2025
@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from a416cd7 to 309ece7 Compare May 29, 2025 14:35
@aa-joshi aa-joshi requested a review from dhartunian May 29, 2025 14:35
@aa-joshi aa-joshi added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 29, 2025
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @kyle-a-wong, and @stevendanna)


-- commits line 8 at r4:
This paragraph could use another edit. I think it's important to clarify where the race condition is, more specifically. What are the two pieces of data that are not in sync?

The issue is that the code expects labelConfig on the SQLMetric to always match up with the length of the labelValuesSlice within each child. This is not guaranteed because we don't lock the metric object with its map during modification of labelConfig. This manifests in the implementation of Each because the code makes assumptions about the length of labelValuesSlice on every child based on the value of labelConfig on the parent.

I think there's also a bug that could manifest with just a single setting being set. There would be no panic, but you'd mistakenly export a db label as an app or vice versa.


pkg/util/metric/aggmetric/agg_metric_test.go line 304 at r4 (raw file):

	})
	r.AddMetric(d)
	d.mu.labelConfig = uint64(metric.LabelConfigAppAndDB)

just take a look at the test code to make sure you don't need to lock anything. I know here there's only one goroutine accessing things.

@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from 309ece7 to 131730e Compare May 30, 2025 04:39
@aa-joshi aa-joshi requested a review from dhartunian May 30, 2025 04:43
Copy link
Contributor Author

@aa-joshi aa-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Abhinav1299, @arjunmahishi, @dhartunian, @kyle-a-wong, and @stevendanna)


pkg/util/metric/aggmetric/agg_metric_test.go line 304 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

just take a look at the test code to make sure you don't need to lock anything. I know here there's only one goroutine accessing things.

I have cross checked tests. Scope of registry and metric declaration is in single goroutine.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @arjunmahishi, @dhartunian, and @kyle-a-wong)


pkg/util/metric/aggmetric/agg_metric.go line 178 at r5 (raw file):

type SQLMetric struct {
	mu struct {
		labelConfig uint64

You can make this type a metric.LabelConfig and remove the casting.

Previously, We are evaluating label config and then accordingly passing label
values to `getOrAddChild` method in SQLMetric. `getOrAddChild` method acquires
lock and then get/add child. This is inadequate because we are evaluating
label config before invoking `getOrAddChild`. This resulted in below issue
get/add child is happening inside the lock:

 P0,T0: initialise metrics with labelConfig as `LabelConfigApp`.
 P0,T1: increment SQL counter with 1 is invoked.
 P0,T2: `getChildByLabelConfig` method evaluates labelConfig as
	`LabelConfigApp`.
 P0,T3: invokes `getOrAddChild` method with just app as parameter.
 P1,T4: `ReinitialiseChildMetrics` acquires the lock, clears existing child
	 metrics, updates labelConfig as `LabelConfigAppAndDB` and release
	 lock.
 P0,T5: `getOrAddChild` acquires the lock, inserts the new child (c1) with app
	 and release lock.
 P2,T6: scrape metrics invokes `Each` method inside the lock. It expects 2
	label values for the child as latest labelConfig is LabelConfigAppAndDB
	and tries to fetch 2 label values app and db. However, child c1 has
	single value (app) which will throw an error.

It is happening because methods on SQLMetric expects length of
`labelValuesSlice` of `ChildMetric` should match according to `labelConfig`
value. This contract is broken in `getOrAddChild` as we don't lock the metric
object with its children map during modification of `labelConfig`. This is
reflected in `Each`'s implementation, where the code assumes that every child's
`labelValuesSlice` has a length consistent with the parent's `labelConfig`.

To address this, this patch makes sure that we are evaluating LabelConfig and
get/add child metric inside the same lock.

Epic: None
Fixes: cockroachdb#147475
Release note (bug fix): Concurrent invocation of child metric updates and
metric reinitialisation will not result in error during scrape.
@aa-joshi aa-joshi force-pushed the reinitialise_bug branch from 131730e to 4517659 Compare May 30, 2025 07:15
@aa-joshi
Copy link
Contributor Author

TFTR!
bors r+

@craig
Copy link
Contributor

craig bot commented May 30, 2025

@craig craig bot merged commit 21d5eb0 into cockroachdb:master May 30, 2025
22 checks passed
Copy link

blathers-crl bot commented May 30, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #147475: branch-release-25.2.1-rc.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-25.2.1-rc Monday, 06/02: release-25.2.1-rc will be frozen branch-release-25.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. target-release-25.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Observability: Panic: index out of range when sql.metrics.application_name.enabled and sql.metrics.database_name.enabled are modified concurrently
4 participants