Skip to content

[DRAFT] [DO NOT REVIEW] Introduce CachedSupplier for BasePersistence objects #1765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

adnanhemani
Copy link
Contributor

I came across an interesting bug yesterday that we need to fix to ensure that tasks can use the BasePersistence object, as they run outside of user call contexts.

What I was trying to do:

  1. Create and run a Task which dumps some information to the persistence. In order to do this, I was using the following line of code to get a BasePersistence object: metaStoreManagerFactory.getOrCreateSessionSupplier(CallContext.getCurrentContext().getRealmContext()).get();
  2. Get the following error message when executing the last .get() call:
jakarta.enterprise.context.ContextNotActiveException: RequestScoped context was not active when trying to obtain a bean instance for a client proxy...

When digging deeper into why this is happening, I realized that due to the Supplier's lazy-loading at https://github.com/apache/polaris/blob/main/extension/persistence/relational-jdbc/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java#L100-L105, the .get() was actually using a RequestScoped realmContext bean given by the previously-ran TokenBroker initialization (which is a RequestScoped object here: https://github.com/apache/polaris/blob/main/quarkus/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java#L290-L299. Given this is a relatively-new addition, this may be why we haven't seen this bug previously.

As Tasks run asynchronously, likely after the original request was already completed, this error actually makes sense - we should not be able to use a request scoped bean inside of a Task execution. But upon further looking, we do not actually need realmContext for anything other than resolving the realmIdentifier once during the BasePersistence object initialization - as a result, we can cache the BasePersistence object using a supplier that caches the original result instead of constantly making new objects. This will also solve our issue, as the original request scoped RealmContext bean will not be used again during the Task's call to get a BasePersistence object.

I've added a test case that shows the difference between the OOTB supplier and my ideal way to solve this problem using a CachedSupplier. If there is significant concern that we cannot cache the BasePersistence object, we can materialize the RealmContext object prior to the supplier so that at a minimum the RequestScoped RealmContext object is not being used - but I'm not sure if there's an easy way to test this, given that the MetastoreFactories are Quarkus ApplicationScoped objects.

Please note, this is an issue in both EclipseLink and JDBC, as they have almost identical code paths here.

Many thanks to @singhpk234 for being my debugging rubber ducky :)

@adnanhemani
Copy link
Contributor Author

adnanhemani commented May 31, 2025

cc @dimas-b (as you are looking at the similar issue at #1758), @eric-maynard , @collado-mike

edit: sorry, wrong PR number

@adnanhemani
Copy link
Contributor Author

cc @adutra as well as you are also looking through #1758 .

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

@adnanhemani thanks for bringing my attention to this PR.

I realized that due to the Supplier's lazy-loading [...] the .get() was actually using a RequestScoped realmContext bean given by the previously-ran TokenBroker initialization

Hmm I looked at your code snippets but I don't see the connection between the TokenBroker bean production and the lazy loading of JdbcBasePersistenceImpl. But assuming that this is happening inside a task executor thread, and the problem is RealmContext, why don't you resolve the realmId eagerly? E.g.:

  private void initializeForRealm(
      RealmContext realmContext, RootCredentialsSet rootCredentialsSet, boolean isBootstrap) {
    String realmId = realmContext.getRealmIdentifier(); // resolve realm ID eagerly
    DatasourceOperations databaseOperations = getDatasourceOperations(isBootstrap);
    sessionSupplierMap.put(
        realmId,
        () ->
            new JdbcBasePersistenceImpl(
                databaseOperations,
                secretsGenerator(() -> realmId, rootCredentialsSet),
                storageIntegrationProvider,
                realmId));

    PolarisMetaStoreManager metaStoreManager = createNewMetaStoreManager();
    metaStoreManagerMap.put(realmId, metaStoreManager);
  }

@adnanhemani
Copy link
Contributor Author

@adutra thanks for taking a look :)

Hmm I looked at your code snippets but I don't see the connection between the TokenBroker bean production and the lazy loading of JdbcBasePersistenceImpl

The connection is that the TokenBroker bean is RequestScoped and it does create a BasePersistence Supplier object as part of the bean initialization using the realmContext in the RequestScoped bean initialization. That BasePersistence Supplier is then stored in the sessionSupplierMap and attempted to be used during the lazy loading - which then tries to load the bean's (now-expired) realmContext. Not sure if this is more clear? Let me know what part that's not clear if not!

But assuming that this is happening inside a task executor thread, and the problem is RealmContext, why don't you resolve the realmId eagerly?

Yes, this was my original idea - but was hard for me to construct a test case for this type of fix. Maybe this is something you've had more experience with - but using a request-scoped realmContext bean during a test was something I just wasn't able to do at all. Additionally, I'm just not sure that we're getting any use for continuously re-creating JdbcBasePersistenceImpl objects - is there really any good reason for us to lazy load this? If not, why not cache the object as-is?

As a result, I'm promoting the CachedSupplier as our preferred way to solve this issue instead. But I'm not heavily tied to this approach if we have a better way to test the way that you suggested.

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

The connection is that the TokenBroker bean is RequestScoped and it does create a BasePersistence Supplier

I still don't see any TokenBroker creating any BasePersistence anywhere in the code 🤔

@adnanhemani as it stands, this PR is imo not mergeable: it has no clear error description, no stack trace that we can investigate, no reproducer, and no test case (CachedSupplierTest is just a unit test, but there is no test that shows evidence of a broken behavior that would be "fixed" by the proposed changes).

@adnanhemani
Copy link
Contributor Author

adnanhemani commented Jun 5, 2025

@adutra - I've reproduced the issue on a branch in my fork: https://github.com/adnanhemani/polaris/tree/ahemani/show_failure_1765

You can read the full diff here, but I made a really simple case here that creates a task when you create a catalog. The task only tries to get the BasePersistence object - which is where the call blows up due to the poisoned cache. Feel free to stick a debugger in there and you'll be able to see, it is because of the lazy loading of the JdbcBasePersistenceImpl class and that the cache poisoning happened due to the creation of the TokenBroker (RequestScoped) bean.

Steps to reproduce the error using the code linked above:

  1. [This can only be reproduced using JDBC or EclipseLink.] Create a Persistence instance and set application.properties to the right set of configurations.
  2. Run: ./polaris --client-id <CLIENT_ID> --client-secret <CLIENT_SECRET> catalogs create polaris1 --storage-type FILE --default-base-location "/var/tmp/polaris1/" (you must try this
  3. Wait for the Task to execute. This will fail and retry - until it runs out of retries altogether and then will print out to logs that the task cannot be successfully completed. A call trace will also be able to be seen here.

You can then apply this PR on top of that code and retry these steps and see that you will no longer see this issue.

More on how the TokenBroker creates the poisoned cache:

  1. tokenBrokerFactory.apply(realmContext): https://github.com/apache/polaris/blob/main/quarkus/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java#L289. Note this is a RequestScoped bean - and so is realmContext.
  2. createTokenBroker(realmContext): https://github.com/apache/polaris/blob/main/service/common/src/main/java/org/apache/polaris/service/auth/JWTRSAKeyPairFactory.java#L53
  3. metaStoreManagerFactory.getOrCreateMetaStoreManager(realmContext): https://github.com/apache/polaris/blob/main/service/common/src/main/java/org/apache/polaris/service/auth/JWTRSAKeyPairFactory.java#L65-L66
  4. initializeForRealm(realmContext, null, false);: https://github.com/apache/polaris/blob/main/persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java#L177

And that call is where the sessionSupplierMap stores the poisoned lambda that creates JdbcBasePersistenceImpl. At no point in this call trace was realmContext reset with a materialized version of the realmIdentifier - which is why it remains as a RequestScoped bean that made its way into the sessionSupplierMap.

Again, your suggestion above to change this behavior by materializing the realmContext (perhaps from the tokenBroker itself) will solve this issue. But I have no idea how to make a test that ensures something like that cannot happen again. If you have an idea, then glad to change the approach to that.

@adnanhemani
Copy link
Contributor Author

A decision on this PR is blocking #1844. In order to unblock the testing on that PR, I've included this change on that PR as well.

@@ -100,7 +101,12 @@ private void initializeForRealm(
final StoreType backingStore = createBackingStore(diagnostics);
sessionSupplierMap.put(
realmContext.getRealmIdentifier(),
() -> createMetaStoreSession(backingStore, realmContext, rootCredentialsSet, diagnostics));
new CachedSupplier<>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this change conforms with the behavior intended by the sessionSupplierMap.
Currently, each call to the supplier yields a new instance - this PR updates the behavior to provide a lazily initialized Persistence instance per realm-ID.
Maybe @collado-mike or @dennishuo could chime in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. This makes me wonder - would my original idea of materializing the RealmContext prior to the creation of the Supplier also become an issue? For instance, would it be possible that the RealmContext is also be computed lazily in some instances?

Let me follow up with @collado-mike and @dennishuo on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, each instance of the session was intended to be scoped to a single request. Though, it seems now that all the current implementations are stateless, but the TransactionalPersistence interface methods kind of imply a stateful implementation - e.g., lookupEntityInCurrentTxn assumes that there is a current transaction that has already been started and will be committed at some point.

Copy link
Contributor Author

@adnanhemani adnanhemani Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so the way I see it is we have really 2 options to fix this issue:

  1. Keep this approach and change the semantics for the TransactionalPersistence interface to be stateless (either now or in the future). OR
  2. Take the approach to materialize the Realm ID and create a new realmContext to pass into the supplier that breaks the dependency on the realmContext that originally came from the function signature. While this is a less invasive change, I do not have an easy way to test this behavior.

I'm leaning towards option 2 simply because it is less invasive - but is everyone else okay without there being hard testing for this few-line change? It would look something like this:

private void initializeForRealm(
      RealmContext realmContext, RootCredentialsSet rootCredentialsSet) {
    final StoreType backingStore = createBackingStore(diagnostics);
    String materializedRealmId = realmContext.getRealmIdentifier();
    RealmContext materializedRealmContext = () -> materializedRealmId;
    sessionSupplierMap.put(
        realmContext.getRealmIdentifier(),
        () -> createMetaStoreSession(backingStore, materializedRealmContext, rootCredentialsSet, diagnostics));

    PolarisMetaStoreManager metaStoreManager = createNewMetaStoreManager();
    metaStoreManagerMap.put(realmContext.getRealmIdentifier(), metaStoreManager);
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question and it's not as straightforward as it might seem. The RealmContext interface only defines a getName method, but there are concrete implementations that may contain extra information about the realm (we have our own custom impl). Simply materializing the RealmContext in this way could break functionality if the underlying Session/MetaStoreManager depend on the concrete implementation.

I think the proper long-term fix is to make the BasePersistence itself a CDI-managed bean so that the RealmContext can be injected by the context rather than us materializing it manually. It also means we have to make the task execution framework CDI-managed, which is a bigger task that we've been putting off for a while


import java.util.function.Supplier;

public class CachedSupplier<T> implements Supplier<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this functionality already exists in Guava.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL - I believe you're talking about Suppliers#memoize. Thanks! Will convert to using this if we continue with this approach!

@adnanhemani adnanhemani changed the title Introduce CachedSupplier for BasePersistence objects [DRAFT] [DO NOT REVIEW] Introduce CachedSupplier for BasePersistence objects Jul 2, 2025
@adnanhemani
Copy link
Contributor Author

Closed in favor of #1988.

@adnanhemani adnanhemani closed this Jul 4, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants