Skip to content

Conversation

eaglerainbow
Copy link

@eaglerainbow eaglerainbow commented Aug 9, 2025

Problem

When multiple concurrent requests arrive with expired access tokens, the AbstractUaaTokenProvider could enter a broken state due to race conditions in the refresh token flow. This occurred because:

  1. Concurrent UAA requests: Multiple threads could simultaneously request new tokens using the same refresh token
  2. Refresh token invalidation: UAA invalidates refresh tokens after use, causing subsequent requests with the old token to fail
  3. Error caching: Failed refresh attempts could be cached, leading to persistent authentication failures

This issue manifested as authentication deadlocks and intermittent token failures in high-concurrency scenarios. It will only appear when the refresh token flow is executed. As this typically happens only rarely (e.g. after 6 hrs), problem detection can be very tedious.

Solution

This PR implements a fix with two key mechanisms:

1. Request Serialization

  • Dedicated Scheduler: Added getTokenScheduler() to ConnectionContext providing a single-threaded scheduler per connection
  • Thread Safety: All token operations are serialized using publishOn(connectionContext.getTokenScheduler())
  • Prevents: Multiple concurrent UAA requests with the same refresh token

2. Request Deduplication

  • Active Request Tracking: Added activeTokenRequests map to track ongoing token requests
  • Atomic Operations: Uses putIfAbsent() to ensure only one request per connection context
  • Prevents: Wasteful duplicate requests when multiple threads need tokens simultaneously

Key Changes

Core Implementation

  • Added getTokenScheduler() method to interface of ConnectionContext
  • Adding concurrency controls to the token request logic in AbstractUaaTokenProvider

Testing

Concurrency unit tests in integration-test style are provided with this PR to avoid regression in future.

Addressed Issues

closes #1146

Copy link

linux-foundation-easycla bot commented Aug 9, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@Kehrlann
Copy link
Contributor

Thanks for the PR @eaglerainbow . I'll need a bit of time to digest the history of this.

Is there a reason this PR is still in draft?

@eaglerainbow
Copy link
Author

eaglerainbow commented Aug 18, 2025

Thanks for the PR, eaglerainbow .

You are welcome!

I'll need a bit of time to digest the history of this.

For sure 😀 This also isn't an issue like others, which you fix just like that 😉
Please ask here (or in the linked issue), if there is something with which you struggle and I might be able to help.

Is there a reason this PR is still in draft?

Yes, and that has a lot to do with the history of this issue 😉
Here's why:

  • The place the issue happens is anything else than nice and easy to fix. If we make a mistake there, a lot of your consumers could have lots of trouble, which they most likely cannot handle/fix on their own. I don't want to let this happen.
  • The fix is so complicated that I think we should discuss the approach first, before we should even consider merging it.
  • Having written unit tests for it is one thing, being able to test in live another. I could not do the latter (yet) - that is why I still have reservation to just "call it a fix and go".
  • I did not want to invest more time into this PR (I worked for several days on the fix for getting it to the point you see here) before I would know that you consider it as being something that is worth pursuing.
  • There might also be better alternatives for the fix out there, but they have not come to my mind, yet.

I see that

Java CI / Java 11 build (pull_request) Failing after 3m

ran red. At the same time JVM 8 seems to have run green. Higher JVM versions seem to be canceled.
Does it make sense that I have a look what caused the failure of the voter?

@Kehrlann
Copy link
Contributor

@eaglerainbow it failed on JDK 11 because we run spotless checks on 11+, but not on 8.

Please run:

./mvnw spotless:apply -Pintegration-test

@eaglerainbow
Copy link
Author

@eaglerainbow it failed on JDK 11 because we run spotless checks on 11+, but not on 8.

Please run:

./mvnw spotless:apply -Pintegration-test

Okay, didn't know that.
That sounds easy: ca1a020
Let's see if I have solved it (you need to trigger the workflow, though).

@Kehrlann
Copy link
Contributor

This passed. I need to make time to review this, probably some time next week.

@eaglerainbow
Copy link
Author

No rush, please.
This topic is risky and complicated enough to deserve some leisure, and it is old enough that it one week more or less doesn't make a difference. Find the time to wrap your head around it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Requests with Expired Access Tokens triggering Refresh Token Flow leads to Broken State (no further requests possible)
2 participants