-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-11838: YARN ConcurrentModificationException When Refreshing Node Attributes #7828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
@shameersss1 Thanks for your contribution. I haven't sat down and looked at the larger code yet, but a couple of questions:
|
💔 -1 overall
This message was automatically generated. |
Thanks @sjlee for the review. Please find the answers inline
The method ReadLock ensures that
Since it is raise condition - Replication through unit test is difficult without inducing artificial sleeps in the core code flow. Ye, the unit test passes even without this change as well. The purpose of this unit test is more of protective measure.
As per my analysis |
The unit test failure seems flaky and not reurun it passed locally,
|
@slfan1989 @TaoYang526 @zeekling could you please review ? |
I see that you're copying the key set while holding the read lock to avoid the issue. I do think it is one correct way to address the issue. That's a valid fix. My only point would be that guarding the logging call might be a cheaper and still correct fix, as it avoids copying. I don't think a keySet() call would cause iteration so that is still safe without the read lock. Let me know what you think. |
The lock is required for creating copy (since it will iterate). The only advantage i see with copying is that the log statement will be consistent with what we process. If we don't copy and some other threads might modify host.attribute after we execute the LOG statement - Will this lead to inconsistent logging and processing. On side note - I don't anticipate a host using a large number of attributes in which case the copy might become expensive @sjlee Any thoughts on this ? |
// other threads might access host.attributes | ||
readLock.lock(); | ||
try { | ||
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from the stack, the problem occurred when the log was printed, and the wrong line was modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that.
- The LOG statement which prints
newNodeToAttributesMap
tries to iteratehost.attribute
host.attribute
gets modified by some other thread - leading to concurrent modification exception.
There are two ways to solve this
- As you said to readLock before LOG statement so that
host.attribute
does not get modified during LOG statement - Create a defensive copy of
host.attribute
(under read lock because the modification can happen at that time as well).
The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute
this will lead to we logging something and processing something else.
Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.
@shameersss1 Thanks for fixing this issue, LGTM. |
@slfan1989 - Gentle reminder for review |
@shameersss1 Should you be adding unit tests to cover the following two edge cases? Or those are already covered?
|
The locking is inconsistent with the other methods in the class which uses try{}finally{} block to release the lock, hence i don't see any concerns here. |
@slfan1989 - Could you please review the same ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a ConcurrentModificationException that occurs when refreshing node attributes in YARN. The issue arises when a LOG statement iterates over host.attributes while another thread modifies the same collection, causing a race condition.
Key changes:
- Implements read locking and defensive copying in
refreshNodeAttributesToScheduler
method - Creates a defensive copy of
host.attributes.keySet()
under read lock protection - Adds comprehensive unit tests to verify the concurrency fix
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
NodeAttributesManagerImpl.java | Fixes race condition by adding read lock and defensive copy when accessing host.attributes |
TestRefreshNodeAttributesConcurrency.java | Adds new test class with concurrent modification tests to verify the fix |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
// other threads might access host.attributes | ||
readLock.lock(); | ||
try { | ||
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new HashSet on every call could impact performance. Consider using Collections.unmodifiableSet() if the caller doesn't need to modify the returned set, or implement a more efficient copying mechanism for frequently called methods.
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet())); | |
newNodeToAttributesMap.put(hostName, Collections.unmodifiableSet(new HashSet<>(host.attributes.keySet()))); |
Copilot uses AI. Check for mistakes.
attributesManager = new NodeAttributesManagerImpl(); | ||
conf.setClass(YarnConfiguration.FS_NODE_ATTRIBUTE_STORE_IMPL_CLASS, | ||
FileSystemNodeAttributeStore.class, NodeAttributeStore.class); | ||
conf = NodeAttributeTestUtils.getRandomDirConf(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NodeAttributeTestUtils class is referenced but not imported. This will cause a compilation error.
Copilot uses AI. Check for mistakes.
@shameersss1 Thank you for your contribution! I will review this part of the code as soon as possible. |
@slfan1989 - Gentle reminder for review |
@shameersss1 It makes sense from my perspective, but can we ensure that Yetus compiles successfully? Sorry for the late reply. In the future, if there's a code review, I will be more mindful of the timing. |
c48dd49
to
441d7df
Compare
💔 -1 overall
This message was automatically generated. |
441d7df
to
f161d24
Compare
💔 -1 overall
This message was automatically generated. |
Description of PR
Refer YARN-11838 for more details.
The issue is that.
There are two ways to solve this
The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute this will lead to we logging something and processing something else.
Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.
How was this patch tested?
Added unit test
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?