Skip to content

Conversation

shameersss1
Copy link
Contributor

@shameersss1 shameersss1 commented Jul 24, 2025

Description of PR

Refer YARN-11838 for more details.

The issue is that.

  1. The LOG statement which prints newNodeToAttributesMap tries to iterate host.attribute
  2. host.attribute gets modified by some other thread - leading to concurrent modification exception.

There are two ways to solve this

  1. To readLock before LOG statement so that host.attribute does not get modified during LOG statement
  2. Create a defensive copy of host.attribute (under read lock because the modification can happen at that time as well).

The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute this will lead to we logging something and processing something else.

Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.

How was this patch tested?

Added unit test

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@sjlee
Copy link
Contributor

sjlee commented Jul 24, 2025

@shameersss1 Thanks for your contribution. I haven't sat down and looked at the larger code yet, but a couple of questions:

  • Why are we using the read lock for a mutation operation? Shouldn't we be using the write lock? The read lock will still permit concurrent operation and is not the right thing to use here, no?
  • Regarding the unit test, I wonder how it is passing even with the read lock? Maybe the concurrency is not enough to reproduce the problem? It would be great if you could reproduce the problem with the old code first and prove that the new code fixes it.
  • Have you done a fully analysis of the all reads and writes to this hashmap so that all read access is protected by the read lock and all write access by the write lock? That is the correct thing to do here.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 21m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 9s trunk passed
+1 💚 compile 1m 6s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 54s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 56s trunk passed
+1 💚 mvnsite 1m 0s trunk passed
+1 💚 javadoc 0m 57s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 50s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 58s trunk passed
+1 💚 shadedclient 41m 16s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 49s the patch passed
+1 💚 compile 0m 57s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 57s the patch passed
+1 💚 compile 0m 47s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 47s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 43s the patch passed
+1 💚 mvnsite 0m 50s the patch passed
+1 💚 javadoc 0m 47s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 42s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 56s the patch passed
+1 💚 shadedclient 41m 56s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 119m 51s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
286m 58s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/artifact/out/Dockerfile
GITHUB PR #7828
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3f403c42a993 5.15.0-139-generic #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c48dd49
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/testReport/
Max. process+thread count 912 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@shameersss1
Copy link
Contributor Author

Thanks @sjlee for the review. Please find the answers inline

* Why are we using the read lock for a mutation operation? Shouldn't we be using the write lock? The read lock will still permit concurrent operation and is not the right thing to use here, no?

The method refreshNodeAttributesToScheduler does not do any writing. It only reads the variable host.attributes which can potentially be modified by some other thread leading to concurrent modification exception. We are also creating defensive copy so that further access is safe.

ReadLock ensures that refreshNodeAttributesToScheduler can be accessed by multiple threads (since there is not writing) and the critical block newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet())); is protected.

* Regarding the unit test, I wonder how it is passing even with the read lock? Maybe the concurrency is not enough to reproduce the problem? It would be great if you could reproduce the problem with the old code first and prove that the new code fixes it.

Since it is raise condition - Replication through unit test is difficult without inducing artificial sleeps in the core code flow. Ye, the unit test passes even without this change as well. The purpose of this unit test is more of protective measure.

* Have you done a fully analysis of the all reads and writes to this hashmap so that all read access is protected by the read lock and all write access by the write lock? That is the correct thing to do here.

As per my analysis host.attributes is accessed during node attribute add , removal and replacing - every access except this one is protected using read/write lock.

@shameersss1
Copy link
Contributor Author

The unit test failure seems flaky and not reurun it passed locally,

[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 285.3 s -- in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor [INFO] [INFO] Results: [INFO] [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 06:13 min [INFO] Finished at: 2025-07-25T15:25:01+05:30

@shameersss1
Copy link
Contributor Author

@slfan1989 @TaoYang526 @zeekling could you please review ?

@sjlee
Copy link
Contributor

sjlee commented Jul 26, 2025

I see that you're copying the key set while holding the read lock to avoid the issue. I do think it is one correct way to address the issue. That's a valid fix.

My only point would be that guarding the logging call might be a cheaper and still correct fix, as it avoids copying. I don't think a keySet() call would cause iteration so that is still safe without the read lock. Let me know what you think.

@shameersss1
Copy link
Contributor Author

shameersss1 commented Jul 26, 2025

I see that you're copying the key set while holding the read lock to avoid the issue. I do think it is one correct way to address the issue. That's a valid fix.

My only point would be that guarding the logging call might be a cheaper and still correct fix, as it avoids copying. I don't think a keySet() call would cause iteration so that is still safe without the read lock. Let me know what you think.

The lock is required for creating copy (since it will iterate). The only advantage i see with copying is that the log statement will be consistent with what we process. If we don't copy and some other threads might modify host.attribute after we execute the LOG statement - Will this lead to inconsistent logging and processing.

On side note - I don't anticipate a host using a large number of attributes in which case the copy might become expensive

@sjlee Any thoughts on this ?

// other threads might access host.attributes
readLock.lock();
try {
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the stack, the problem occurred when the log was printed, and the wrong line was modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that.

  1. The LOG statement which prints newNodeToAttributesMap tries to iterate host.attribute
  2. host.attribute gets modified by some other thread - leading to concurrent modification exception.

There are two ways to solve this

  1. As you said to readLock before LOG statement so that host.attribute does not get modified during LOG statement
  2. Create a defensive copy of host.attribute (under read lock because the modification can happen at that time as well).

The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute this will lead to we logging something and processing something else.

Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.

@shameersss1 shameersss1 requested a review from zeekling July 28, 2025 08:52
@TaoYang526
Copy link
Contributor

@shameersss1 Thanks for fixing this issue, LGTM.

@shameersss1
Copy link
Contributor Author

@slfan1989 - Gentle reminder for review

@violetnspct
Copy link

@shameersss1 Should you be adding unit tests to cover the following two edge cases? Or those are already covered?

  1. Lock acquisition failure. Important because lock acquisition could fail in high contention.
  2. Exception during locked section. Important to verify lock release in error conditions

@shameersss1
Copy link
Contributor Author

@shameersss1 Should you be adding unit tests to cover the following two edge cases? Or those are already covered?

1. Lock acquisition failure. Important because lock acquisition could fail in high contention.

2. Exception during locked section. Important to verify lock release in error conditions

The locking is inconsistent with the other methods in the class which uses try{}finally{} block to release the lock, hence i don't see any concerns here.

@shameersss1
Copy link
Contributor Author

@slfan1989 - Could you please review the same ?

@slfan1989 slfan1989 requested review from slfan1989 and Copilot August 14, 2025 01:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a ConcurrentModificationException that occurs when refreshing node attributes in YARN. The issue arises when a LOG statement iterates over host.attributes while another thread modifies the same collection, causing a race condition.

Key changes:

  • Implements read locking and defensive copying in refreshNodeAttributesToScheduler method
  • Creates a defensive copy of host.attributes.keySet() under read lock protection
  • Adds comprehensive unit tests to verify the concurrency fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
NodeAttributesManagerImpl.java Fixes race condition by adding read lock and defensive copy when accessing host.attributes
TestRefreshNodeAttributesConcurrency.java Adds new test class with concurrent modification tests to verify the fix

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// other threads might access host.attributes
readLock.lock();
try {
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet()));
Copy link
Preview

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new HashSet on every call could impact performance. Consider using Collections.unmodifiableSet() if the caller doesn't need to modify the returned set, or implement a more efficient copying mechanism for frequently called methods.

Suggested change
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet()));
newNodeToAttributesMap.put(hostName, Collections.unmodifiableSet(new HashSet<>(host.attributes.keySet())));

Copilot uses AI. Check for mistakes.

attributesManager = new NodeAttributesManagerImpl();
conf.setClass(YarnConfiguration.FS_NODE_ATTRIBUTE_STORE_IMPL_CLASS,
FileSystemNodeAttributeStore.class, NodeAttributeStore.class);
conf = NodeAttributeTestUtils.getRandomDirConf(conf);
Copy link
Preview

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodeAttributeTestUtils class is referenced but not imported. This will cause a compilation error.

Copilot uses AI. Check for mistakes.

@slfan1989
Copy link
Contributor

@slfan1989 - Could you please review the same ?

@shameersss1 Thank you for your contribution! I will review this part of the code as soon as possible.

@shameersss1
Copy link
Contributor Author

@slfan1989 - Gentle reminder for review

@slfan1989
Copy link
Contributor

@slfan1989 - Gentle reminder for review

@shameersss1 It makes sense from my perspective, but can we ensure that Yetus compiles successfully?

Sorry for the late reply. In the future, if there's a code review, I will be more mindful of the timing.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 32s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+0 🆗 mvndep 9m 1s Maven dependency ordering for branch
+1 💚 mvninstall 32m 14s trunk passed
+1 💚 compile 15m 51s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 13m 42s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 4m 19s trunk passed
+1 💚 mvnsite 2m 55s trunk passed
+1 💚 javadoc 2m 24s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 58s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 4m 43s trunk passed
+1 💚 shadedclient 36m 38s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 31s Maven dependency ordering for patch
+1 💚 mvninstall 1m 43s the patch passed
+1 💚 compile 15m 3s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 15m 3s the patch passed
+1 💚 compile 13m 36s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 13m 36s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 checkstyle 4m 10s the patch passed
+1 💚 mvnsite 2m 51s the patch passed
+1 💚 javadoc 2m 18s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 59s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 5m 0s the patch passed
+1 💚 shadedclient 36m 58s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 22m 38s hadoop-common in the patch passed.
+1 💚 unit 112m 53s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 asflicense 1m 7s The patch does not generate ASF License warnings.
348m 8s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/2/artifact/out/Dockerfile
GITHUB PR #7828
Optional Tests dupname asflicense mvnsite unit codespell detsecrets xmllint compile javac javadoc mvninstall shadedclient spotbugs checkstyle
uname Linux 7233067d3421 5.15.0-152-generic #162-Ubuntu SMP Wed Jul 23 09:48:42 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 441d7df
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/2/testReport/
Max. process+thread count 1429 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 3m 7s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 38m 14s trunk passed
+1 💚 compile 1m 3s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 55s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 56s trunk passed
+1 💚 mvnsite 1m 1s trunk passed
+1 💚 javadoc 0m 57s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 47s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 55s trunk passed
+1 💚 shadedclient 35m 54s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 48s the patch passed
+1 💚 compile 0m 52s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 52s the patch passed
+1 💚 compile 0m 46s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 46s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 43s the patch passed
+1 💚 mvnsite 0m 51s the patch passed
+1 💚 javadoc 0m 43s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 40s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 57s the patch passed
+1 💚 shadedclient 36m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 113m 30s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
242m 18s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/3/artifact/out/Dockerfile
GITHUB PR #7828
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux fca7c4ebb1e6 5.15.0-152-generic #162-Ubuntu SMP Wed Jul 23 09:48:42 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / f161d24
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/3/testReport/
Max. process+thread count 979 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants