Reusable module cache #4621

graydon · 2025-01-11T01:40:26Z

This is the stellar-core side of a soroban change to surface the module cache for reuse.

On the core side we:

Add a new mechanism to scan the live BL snapshot for contracts alone (which is a small part of the BL)
Add a new wrapper type SorobanModuleCache to the core-side Rust code, which holds a soroban_env_host::ModuleCache for each host protocol version we support caching modules for (there is currently only one but there will be more in the future)
Also add a CoreCompilationContext type to contract.rs which carries a Budget and logs errors to the core console logging system. This is sufficient to allow operating the soroban_env_host::ModuleCache from outside the Host.
Pass a SorobanModuleCache into the host function invocation path that core calls during transactions.
Store a SorobanModuleCache in the LedgerManagerImpl that is long-lived, spans ledgers.
Add a utility class SharedModuleCacheCompiler that does a multithreaded load-all-contracts / populate-the-module-cache, and call this on startup when the LedgerManagerImpl restores its LCL.
Add a couple command-line helpers that list and forcibly compile all contracts (we might choose to remove these, they were helpful during debugging but at this point perhaps redundant)

The main things left to do here are:

Maybe move all this to the soon-to-be-added p23 soroban submodule
Decide how we're going to handle growing the cache with new uploads and wire that up
Similarly decide how we're going to handle expiry and restoration events as discussed in CAP0062 and wire that up too
Write some tests if I can think of any more specific than just "everything existing still works"
Write a CAP describing this

I think that's .. kinda it? The reusable module cache is just "not passed in" on p22 and "passed in" on p23, so it should just start working at the p23 boundary.

dmkozh

IIUC this still needs to be rebased on top of p23 module and have the host change merged, so I haven't looked too closely at the rust parts (IIUC currently we just don't pass the module cache to the host at all).

src/ledger/SharedModuleCacheCompiler.h

src/ledger/SharedModuleCacheCompiler.cpp

src/ledger/SharedModuleCacheCompiler.h

src/main/CommandLine.cpp

src/rust/src/contract.rs

SirTyson

I think this generally looks pretty good, the producer/compiler thread divide is a good idea! I think the producer thread needs some changes though.

To maintain the cache, I think we need to

Compile all live contract modules on startup.
Compile new contract modules during upload/restore.
Evict entries from the cache when the underlying WASM is evicted via State Archival eviction scans.

I think decoupling cache gc from eviction events is going to be expensive. If you have some background task that checks if a given module is live or not, it will have to read from many levels of the BL to determine if whatever BucketEntry it's looking at is the most up to date. Evicting from the cache when we do state archival evictions will remove this additional read amplification (since the archival scan has to this this multi level search already) and is simpler to maintain cache validity too.

The drawback to this is intial cache generation is a little more expensive, as we're limited to a single producer thread that has to iterate throught the BL in-order and keep track of seen keys. If we don't have a proper gc, we can't add any modules that have already been evited since they would cause a memory leak.

Looking at startup as a whole, we have a bunch of tasks that are BL disk read dominated, namely Bucket Apply, BucketIndex, and p23's upcoming Soroban state cache. Bucket Index can process all Buckets in parallel, but Bucket Apply, Soroban State cache, and Module Cache all require a single thread iterating the BL in order due to the outdated keys issue (in the future we could do this in parallel where each level marks it's "last seen key" and lower levels can't make progress beyond all their parents last seen key, but that's too invloved for v1).

Given that we're adding a bunch of work on the startup path and Horizon/RPC has indicated a need for faster startup times in the past, I think it makes sense to condense the Bucket Apply, Soroban state cache population, and Module cache producer thread into a single Work that makes a one shot pass on the BucketList. Especially in a captive-core instance, which we still run in our infra on EBS last I checked, I assume we're going to be disk bound even with the compilation step, so if we do compilation in the same pass as BucketApply we might just get it for free.

I don't think this needs to be in 23.0 (other than the memory leak issue), but if we have this in mind and make the initial version a little more friendly with the other Work tasks that happen on startup, it'll be easier to optimize this later.

src/ledger/SharedModuleCacheCompiler.h

src/ledger/SharedModuleCacheCompiler.cpp

src/ledger/SharedModuleCacheCompiler.h

src/ledger/SharedModuleCacheCompiler.cpp

src/bucket/BucketSnapshot.h

src/ledger/SharedModuleCacheCompiler.cpp

src/main/ApplicationUtils.cpp

graydon · 2025-01-23T06:09:33Z

Updated with the following:

Rebased
Adapted to simplified module cache in recently-merged upstream
Moved soroban changes to p23 submodule
Incremental expiry and addition of new contracts after ledger close
Less-chatty logging
Added support for populating module cache for side-by-side testing with SOROBAN_TEST_EXTRA_PROTOCOL=23

graydon · 2025-01-24T04:15:42Z

Updated with fix for test failure when running with background close, as well as all review comments addressed (I think!)

dmkozh

LGTM, but CI is still failing

SirTyson

I think there's a bug in the ApplyBuckets work path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we call compileAllContractsInLedger in AssumeStateWork. After calling app.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until after AssumeStateWork is down (see LedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.

graydon · 2025-01-25T01:36:14Z

I think there's a bug in the ApplyBuckets work path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we call compileAllContractsInLedger in AssumeStateWork. After calling app.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until after AssumeStateWork is down (see LedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.

Ah! You're right.

That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup AssumeState as well. Incorrectly (as you mention) and redundantly with the explicit call to compileAllContractsInLedger below, at the bottom of LedgerManagerImpl::loadLastKnownLedger.

I think the fix here is to remove the call to compileAllContractsInLedger from AssumeStateWork altogether, and just do it in LedgerApplyManager when it decides the CatchupWork is complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.

SirTyson · 2025-01-25T01:39:01Z

I think there's a bug in the ApplyBuckets work path. Currently, we use the BucketList snapshot to compile WASM. Problem is, we call compileAllContractsInLedger in AssumeStateWork. After calling app.getBucketManager().assumeState, the raw BucketList is up to date, but we don't actually update the BucketList snapshots until after AssumeStateWork is down (see LedgerManagerImpl.cpp:392). I'm pretty sure we'd currently only compile state based on the genesis ledger when we hit the apply bucket path.

Ah! You're right.

That path is actually only there to try to make compilation happen "post catchup", even though as it stands it is also triggered on the startup AssumeState as well. Incorrectly (as you mention) and redundantly with the explicit call to compileAllContractsInLedger below, at the bottom of LedgerManagerImpl::loadLastKnownLedger.

I think the fix here is to remove the call to compileAllContractsInLedger from AssumeStateWork altogether, and just do it in LedgerApplyManager when it decides the CatchupWork is complete. Then there won't be a redundant (and wrong!) compile happening during startup, and there will be a non-redundant (and hopefully right!) compile happening during catchup.

Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.

graydon · 2025-01-25T01:47:38Z

Hmm I think we still have to compile before catch-up though (or at least before applying ledgers in catch-up)? I assume (perhaps incorrectly) that we have asserts that compilation is always cached and will remove the functionality to lazily compile modules during tx invocation in p23. TX replay would break these invariants.

Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?

graydon · 2025-01-25T01:50:42Z

(In terms of asserts: there is no assert in the code currently that "every contract is cached" before we call a txn; fortunately or unfortunately, soroban will not-especially-gracefully degrade to making throwaway modules as needed. We could put an assert on the core side, and I guess we should since it represents a bug-that-turns-into-a-performance-hazard)

SirTyson · 2025-01-25T01:52:00Z

Hmmmm maybe! I am unclear on where we should compile during catchup, then. Perhaps in LedgerManagerImpl::setLastClosedLedger, at the end?

Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.

graydon · 2025-01-25T02:16:33Z

Yeah, that makes sense to me. Compilation is basically part of assuming the current ledger state, so seems like a reasonable place to put it.

Done.

src/main/AppConnector.cpp

src/catchup/AssumeStateWork.cpp

src/ledger/LedgerManagerImpl.cpp

src/ledger/test/LedgerTests.cpp

src/ledger/LedgerManagerImpl.cpp

src/transactions/test/InvokeHostFunctionTests.cpp

src/ledger/LedgerManagerImpl.cpp

graydon · 2025-02-08T06:27:27Z

Rebased around @SirTyson's latest BL changes, repaired tests, integrated bits of residual feedback, squashed and cleaned up dead/redundant states, and picked out what I think were all the build-breaking bugs.

The last one (in 1bb1b8d) is one I suspect @SirTyson might disagree with; I do not like it either, and am very open to other options here! In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

The presence of the statics means that global state of the LoadGenerator carries forward from one test in the unit testsuite to another. This in turn means that depending on how tests are partitioned by the test runner (which changes as we add tests to the testsuite) tests that rely on that state break or un-break (in particular the REQUIRE_THROWS_WITH check in testcase "generate soroban load" of LoadGeneratorTests.cpp breaks or un-breaks depending on test order).

I think this is not ideal! Resetting the LoadGenerator statics on construction, as I'm doing in this change, seems like the least we could do. And seems (when I test it locally) to be enough to get tests passing. For now. But it also seems wrong. Like: why have static state at all? Is there no way to get rid of it?

dmkozh · 2025-02-10T19:03:56Z

In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.

graydon · 2025-02-10T23:07:14Z

In general I do not understand, despite a lot of code-reading, why there are static members of LoadGenerator rather than just member variables.

Yeah, I couldn't immediately tell either. Moreover, we seem to reset that 'soroban state' quite in a few places already. I suspect this works around some very particular scenarios that could just be fixed by preserving the loadgen instance in-between loadgen calls.

I talked to @SirTyson about this this morning and he thinks it was just a misunderstanding -- like when it was written he was somehow thinking that LoadGenerator wasn't long-lived or something. He believes it should be fine to remove.

After that discussion, I removed the static qualifiers and at least locally on my system the tests all pass. Pushed update with that change. I guess if it works, we can just see if we rediscover some reason for those qualifiers in the future?

SirTyson

LGTM! I'd like to see some tracy zones on some of the performance sensitive functions (although to be fair these may already exist rust side, I didn't look too closely on non-C++ stuff). Also a minor refactoring point on AppConnector

src/ledger/LedgerManagerImpl.cpp

SirTyson · 2025-02-11T00:38:12Z

src/main/AppConnector.h

+    // cache handle point to the same, shared, threadsafe module cache. It may
+    // periodically be replaced by a delete/rebuild cycle in the LedgerManager.
+    // This is always done under a mutex.
+    rust::Box<rust_bridge::SorobanModuleCache> mModuleCache;


I think this goes against the original design intention of the app connector. AppConnector is originally intended to ba a lightweight "threadsafe" wrapper around App. Since App has many main-thread only functions, it's dangerous to pass around, so we have a handful of functions in AppConnector that we've carefully audited and know are thread safe. Having state changes that contract, and makes me think the mModuleCache and mModuleCacheMutex should be moved to LedgerManager, and AppConnector just has a pass through to LedgerManager for those calls.

I did try this initially but I then convinced myself that it wasn't correct, due to potential skews between the kicking-off of cache recompilation (which starts on main thread from the LCL-advancing step of ledgerClosed) and the point where cache recompilation ends (on the close thread, just before execution) and the point where the cache is used (on the close thread, repeatedly on each transaction).

Like I think the close thread can queue up multiple LCL-advancing callbacks for the main thread, and so any one of those on the main thread could decide to recompile the cache, making it change mid way through the close thread using it. Which wouldn't necessarily be a data race -- we mutex it -- but it would risk a nondeterministic miss or something, if the cache switches between two txs.

But now, re-reading it, to be honest I can't convince myself of what it should do, because I can't tell which functions run on which threads without working backwards from first principles each time. This is one of the things that really worried (and continue to worry) me about non-main-thread closing in general: I can't clearly reason about which variables are advancing when, and there's no way to enforce separation between them.

For the time being we should pause landing this until I can re-derive a crystal clear explanation, state-partition and lifecycle, and diagram it, and possibly figure out some way of actually enforcing it. I am really frustrated and really sorry.

Hmmm, why do we need to kick of the recompilation in the main thread? I agree the story gets a little confusing if we start bouncing between main and background threads, but I don't think that's necessary here.

We know that the cache is stricly an apply-time cache (or should be an apply time cache), barring startup. There's no reason I can see that the cache then shouldn't just be completely managed by the apply thread. I think from a thread perspective, recompilation is similar to background eviction, where it's launched and joined on the apply thread. We can join the compilation right before we start apply (in the apply thread), then kick it off at the same place we call eviction scan (also in the apply thread). While we haven't techincally advanced LCL yet, the apply thread still has access to all upgraded network settings and protocol upgrades at the point at which we kick off the eviction scan, since we just finished applying those upgrades, so there shouldn't be any issues of stale snapshots or inconsistencies. We can guarantee that nothing has happened that could invalidate the cache in between it's start and finish, as only the apply thread can do that, and we only start/join compilation at the start and end of apply thread.

With this approach we wouldn't even need locks. I believe Brett introduced the concept of a "thread assert" in #4617, where you can assertIsMain() and assertIsApplyThread(), which I imagine would be sufficient safety wise.

@SirTyson Ok I've adapted this to the post-LedgerManager-refactoring PR (#4647) and rebased on master. Sure enough the additional structure was enough to make it fairly trivially clear which snapshots to use, when, and why.

The previous iteration of the PR was mostly (or perhaps entirely) correct already, in the sense of compiling the right snapshots, but it was doing so more by accident than by intent. Or perhaps if I'm being very generous to myself: I had figured out the right place to do the recompiles already but forgotten why they were right, and was having trouble reconstructing the rationale. It's quite a bit more explicit now.

src/simulation/LoadGenerator.h

graydon · 2025-02-15T03:48:09Z

@SirTyson I believe this is ready to go again, though .. if we're holding back until post-release, it can continue to wait. But I'm not worried about its snapshot logic anymore!

graydon requested review from SirTyson, dmkozh and sisuresh January 11, 2025 01:40

dmkozh reviewed Jan 13, 2025

View reviewed changes

SirTyson requested changes Jan 14, 2025

View reviewed changes

graydon force-pushed the reusable-module-cache branch from c1ffdf7 to 89d85c2 Compare January 15, 2025 05:42

graydon force-pushed the reusable-module-cache branch from 89d85c2 to dd5b7eb Compare January 23, 2025 06:00

dmkozh reviewed Jan 24, 2025

View reviewed changes

SirTyson requested changes Jan 25, 2025

View reviewed changes

dmkozh reviewed Jan 29, 2025

View reviewed changes

dmkozh reviewed Feb 4, 2025

View reviewed changes

src/transactions/test/InvokeHostFunctionTests.cpp Outdated Show resolved Hide resolved

src/transactions/test/InvokeHostFunctionTests.cpp Outdated Show resolved Hide resolved

graydon force-pushed the reusable-module-cache branch 3 times, most recently from aab832b to db843ff Compare February 7, 2025 03:40

dmkozh previously approved these changes Feb 7, 2025

View reviewed changes

src/ledger/LedgerManagerImpl.cpp Outdated Show resolved Hide resolved

graydon dismissed dmkozh’s stale review via 5fb3aa5 February 8, 2025 05:59

graydon force-pushed the reusable-module-cache branch 2 times, most recently from 5fb3aa5 to 1bb1b8d Compare February 8, 2025 06:21

graydon force-pushed the reusable-module-cache branch from 1bb1b8d to 11dbcfa Compare February 10, 2025 23:02

dmkozh previously approved these changes Feb 10, 2025

View reviewed changes

SirTyson previously approved these changes Feb 11, 2025

View reviewed changes

graydon added 7 commits February 14, 2025 19:08

Add helpers to extract CONTRACT_CODE entries efficiently from BL.

2920aa1

Support reusable module cache with multithreaded pre-compilation

a5af74f

Add tests for reusable module cache.

fb5d75d

Bump max_connections in temp pg cluster due to more parallelism lately

605b44b

Record meta hashes for new tests

c72192b

Make test reporter robust against odd callback order (this happens)

0f06403

Remove mistaken 'static' from LoadGenerator soroban state.

cb25d21

graydon dismissed stale reviews from SirTyson and dmkozh via cb25d21 February 15, 2025 03:25

graydon force-pushed the reusable-module-cache branch from 11dbcfa to cb25d21 Compare February 15, 2025 03:25

Add some tracy zones

48fec10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusable module cache #4621

Reusable module cache #4621

graydon commented Jan 11, 2025 •

edited

Loading

dmkozh left a comment

SirTyson left a comment

graydon commented Jan 23, 2025 •

edited

Loading

graydon commented Jan 24, 2025

dmkozh left a comment

SirTyson left a comment

graydon commented Jan 25, 2025

SirTyson commented Jan 25, 2025

graydon commented Jan 25, 2025

graydon commented Jan 25, 2025

SirTyson commented Jan 25, 2025

graydon commented Jan 25, 2025

graydon commented Feb 8, 2025

dmkozh commented Feb 10, 2025

graydon commented Feb 10, 2025

SirTyson left a comment

SirTyson Feb 11, 2025

graydon Feb 11, 2025

SirTyson Feb 11, 2025

SirTyson Feb 11, 2025 •

edited

Loading

graydon Feb 15, 2025

graydon commented Feb 15, 2025

Reusable module cache #4621

Are you sure you want to change the base?

Reusable module cache #4621

Conversation

graydon commented Jan 11, 2025 • edited Loading

dmkozh left a comment

Choose a reason for hiding this comment

SirTyson left a comment

Choose a reason for hiding this comment

graydon commented Jan 23, 2025 • edited Loading

graydon commented Jan 24, 2025

dmkozh left a comment

Choose a reason for hiding this comment

SirTyson left a comment

Choose a reason for hiding this comment

graydon commented Jan 25, 2025

SirTyson commented Jan 25, 2025

graydon commented Jan 25, 2025

graydon commented Jan 25, 2025

SirTyson commented Jan 25, 2025

graydon commented Jan 25, 2025

graydon commented Feb 8, 2025

dmkozh commented Feb 10, 2025

graydon commented Feb 10, 2025

SirTyson left a comment

Choose a reason for hiding this comment

SirTyson Feb 11, 2025

Choose a reason for hiding this comment

graydon Feb 11, 2025

Choose a reason for hiding this comment

SirTyson Feb 11, 2025

Choose a reason for hiding this comment

SirTyson Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

graydon Feb 15, 2025

Choose a reason for hiding this comment

graydon commented Feb 15, 2025

graydon commented Jan 11, 2025 •

edited

Loading

graydon commented Jan 23, 2025 •

edited

Loading

SirTyson Feb 11, 2025 •

edited

Loading