-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per Script Invocation Lua Memory Limits #903
base: main
Are you sure you want to change the base?
Conversation
…ust to prove it's possible; squashing to clean up _a lot_ of experimentation commits
… for LuaScripts, fixes that
It will probably be unusual to use this allocator, but it shouldn't be _bad_ either.
This would be the most concerning for the PR. What is causing this drop, and if it is the trampoline, then is there a way to enable an unsafe mode that avoids this overhead? |
public SessionScriptCache(StoreWrapper storeWrapper, IGarnetAuthenticator authenticator, ILogger logger = null) | ||
{ | ||
this.storeWrapper = storeWrapper; | ||
this.logger = logger; | ||
|
||
scratchBufferNetworkSender = new ScratchBufferNetworkSender(); | ||
processor = new RespServerSession(0, scratchBufferNetworkSender, storeWrapper, null, authenticator, false); | ||
|
||
// There's some parsing involved in these, so save them off per-session | ||
memoryManagementMode = storeWrapper.serverOptions.LuaOptions.MemoryManagementMode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems these lines are causing BDN for BasicOperations, ObjectOperations, HashObjectOperations to fail as something (perhaps storeWrapper) is null here:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
---> System.NullReferenceException: Object reference not set to an instance of an object.
at Garnet.server.SessionScriptCache..ctor(StoreWrapper storeWrapper, IGarnetAuthenticator authenticator, ILogger logger) in /_/libs/server/Lua/SessionScriptCache.cs:line 41
at Garnet.server.RespServerSession..ctor(Int64 id, INetworkSender networkSender, StoreWrapper storeWrapper, SubscribeBroker`3 subscribeBroker, IGarnetAuthenticator authenticator, Boolean enableScripts) in /_/libs/server/Resp/RespServerSession.cs:line 221
at Embedded.server.EmbeddedRespServer.GetRespSession() in /_/benchmark/BDN.benchmark/Embedded/EmbeddedRespServer.cs:line 41
at BDN.benchmark.Operations.OperationsBase.GlobalSetup() in /_/benchmark/BDN.benchmark/Operations/OperationsBase.cs:line 80
at BDN.benchmark.Operations.BasicOperations.GlobalSetup() in /_/benchmark/BDN.benchmark/Operations/BasicOperations.cs:line 20
at BenchmarkDotNet.Engines.EngineFactory.CreateReadyToRun(EngineParameters engineParameters)
at BenchmarkDotNet.Autogenerated.Runnable_0.Run(IHost host, String benchmarkName) in /_/benchmark/BDN.benchmark/bin/Release/net8.0/cb61c2e4-da46-43ab-8a17-882e6ff8a654/cb61c2e4-da46-43ab-8a17-882e6ff8a654.notcs:line 177
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
--- End of inner exception stack trace ---
at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[] args) in /_/benchmark/BDN.benchmark/bin/Release/net8.0/cb61c2e4-da46-43ab-8a17-882e6ff8a654/cb61c2e4-da46-43ab-8a17-882e6ff8a654.notcs:line 57
Example action run: https://github.com/microsoft/garnet/actions/runs/12681127499/job/35344227191
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run the BDN Operations.ScriptOperations - the allocated value for "LargeScript" is now showing 23 bytes when it used to be 12. Is that expected / OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads up ... I will push an update to this PR to do the following
- Update the BDN CI Action YML so it runs (and charts) Lua.LuaScriptCacheOperations and Lua.LuaRunnerOperations
- Update BDN_Benchmark_Config.json with most recent allocated byte numbers. This file is the "ground truth" of what we expect the Allocated value is. Since I don't have history of your new BDN metrics, I will set the "expected" values to what we are seeing currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run the BDN Operations.ScriptOperations - the allocated value for "LargeScript" is now showing 23 bytes when it used to be 12. Is that expected / OK?
In my experience BDN memory tracking that gets down to just a handful of bytes is kidna inherently variable, so 23-vs-12 isn't concerning.
I am also seeing this in the BDN.benchmark.Lua.LuaRunnerOperations which means the BDN is failing to create the metrics.
I don't see those results in the link you shared? Should I be looking somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok regarding the 23 vs 12. I will update the expected value.
The BDN.benchmark.Lua.LuaRunnerOperations were not part of our test runs. It will be part of my push to the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen some runs where the allocated is 668 but then ran again it comes back as 1024 or 1312. I have seen this on same platform, but also on different platform. For example:
On windows: LookupHit Tracked,None = 668
On Linux: LookupHit Tracked,None = 1024
Our expected value doesn't differentiate platform, so I can just put 1024 and they both will pass.
My question - is it expected that the allocated can vary a bit from run to run (even on same platform)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** Pushed my changes **
BDN Run with the latest BDN fixes, two new Lua BDNs (Lua.LuaScriptCacheOperations and Lua.LuaRunnerOperations) and my fixes to expected values.: https://github.com/microsoft/garnet/actions/runs/12698624593
Reminder - all results log files are at the bottom of the BDN test
Looks like LuaRunnerOperations BDN test itself is failing as results coming back NA.
CompileForSessionSmall | Managed,Limit | NA | NA | NA | NA | NA | NA | NA | NA |
From Results file:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
---> Garnet.common.GarnetException: Failed to write to response buffer
at Garnet.common.GarnetException.Throw(String message, LogLevel logLevel) in //libs/common/GarnetException.cs:line 57
at Garnet.server.LuaRunner.CompileCommon[TResponse](TResponse& resp) in //libs/server/Lua/LuaRunner.cs:line 564
at Garnet.server.LuaRunnerTrampolines.CompileForSession(IntPtr luaState) in //libs/server/Lua/LuaRunner.cs:line 1568
at Garnet.server.NativeMethods.lua_pcallk(IntPtr luaState, Int32 nargs, Int32 nresults, Int32 msgh, IntPtr ctx, IntPtr k)
at Garnet.server.LuaRunner.CompileForSession(RespServerSession session) in //libs/server/Lua/LuaRunner.cs:line 453
at BDN.benchmark.Lua.LuaRunnerOperations.CompileForSessionSmall() in //benchmark/BDN.benchmark/Lua/LuaRunnerOperations.cs:line 218
at BenchmarkDotNet.Autogenerated.Runnable_4.WorkloadActionUnroll(Int64 invokeCount) in //benchmark/BDN.benchmark/bin/Release/net8.0/1e3667a0-3c30-49b8-9ec2-d2045162aeb7/1e3667a0-3c30-49b8-9ec2-d2045162aeb7.notcs:line 1068
at BenchmarkDotNet.Engines.Engine.Measure(Action1 action, Int64 invokeCount) at BenchmarkDotNet.Engines.Engine.RunIteration(IterationData data) at BenchmarkDotNet.Engines.EngineStage.RunIteration(IterationMode mode, IterationStage stage, Int32 index, Int64 invokeCount, Int32 unrollFactor) at BenchmarkDotNet.Engines.EngineStage.Run(IStoppingCriteria criteria, Int64 invokeCount, IterationMode mode, IterationStage stage, Int32 unrollFactor) at BenchmarkDotNet.Engines.EngineWarmupStage.Run(Int64 invokeCount, IterationMode iterationMode, Int32 unrollFactor, RunStrategy runStrategy) at BenchmarkDotNet.Engines.EngineWarmupStage.RunWorkload(Int64 invokeCount, Int32 unrollFactor, RunStrategy runStrategy) at BenchmarkDotNet.Engines.Engine.Run() at BenchmarkDotNet.Autogenerated.Runnable_4.Run(IHost host, String benchmarkName) in /_/benchmark/BDN.benchmark/bin/Release/net8.0/1e3667a0-3c30-49b8-9ec2-d2045162aeb7/1e3667a0-3c30-49b8-9ec2-d2045162aeb7.notcs:line 951 at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span
1 copyOfArgs, BindingFlags invokeAttr)
--- End of inner exception stack trace ---
at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)
at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)
at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[] args) in /_/benchmark/BDN.benchmark/bin/Release/net8.0/1e3667a0-3c30-49b8-9ec2-d2045162aeb7/1e3667a0-3c30-49b8-9ec2-d2045162aeb7.notcs:line 57
// AfterAll
1) Added a check for NA in results which is an indication that the BDN test failed at run time 2) Added 'Lua.LuaScriptCacheOperations','Lua.LuaRunnerOperations' to BDN Github Action 3) Updated Expected values for the new Lua BDN tests
…into luaMemoryLimits
Doing some light profiling, it's the extra pcall layer. I'll look at clawing some of this back. |
Another decent sized one, though hopefully this is the last "big" Lua PR - the rest I can foresee should be smaller.
TODOs:
Are memory pressure updated necessary? Have a thread with .NET GC folks for this.Got our answer, they are correct to have here.Behavior when scripts aborted? Redis is weird here.It's reasonable for writes that happened pre-abort to still happen. We can explore rollback if there's a pressing need, but it's non-trivial.This introduces the ability to specify maximum memory limits for Lua scripts, currently this a single config (
--lua-script-memory-limit
). To enable this we also have to introduce custom allocators (--lua-memory-management-mode
) for Lua, there are 3 in this PR:Native
(the current behavior, where Lua provides the allocator),Tracked
(where memory is acquired withNativeMemory
and GC pressure is updated), andManaged
(where a POH array is pre-allocated and memory is obtained from a freelist punned over that allocation).In order to gracefully handle Lua OOMs more of the operation of
LuaRunner
(things like compilation and the preamble) is hidden behind Lua PCalls. This is a necessary change, as the default behavior of Lua is to abort the process in the face of OOMs - PCalls prevent that.To make the PCall changes less expensive (and just generally less awful), I introduced some (Strong, not Pinned) GCHandles, function pointers, and trampolines. At the end of this, we're basically just using KeraLua to package Lua and define some constants - none of the .NET code is really running anymore. If we really wanted, we could build Lua ourselves (maybe even drop down to 5.1 to match Redis) and exploit that tight coupling - but I have no intention of doing so at this time.
When improving the Lua OOM RESP error, I also found a bug in previous PR around buffer management - it is fixed in this commit.
The Allocators
Native
This is the default.
This just uses the built-in Lua allocator, which is a thin shim over
malloc
. It should perform bit better thanTracked
simply because there isn't any .NET code in the way.Native does not support memory limits.
Tracked
A thin wrapper over
NativeMemory
. It supports memory limits, and will fail once total requested bytes exceeds the configure limit. Since it cannot see the overhead ofNativeMemory
the limit is only softly enforced.This currently calls GC.(Add|Remove)MemoryPressure, but see Open Questions.
Managed (w/ and w/o Limits)
A really basic free-list based allocator over a POH array. It pre-allocates the total limit, and (if one is configured) it strictly limits allocations since the overhead can be seen.
If a limit is not configured, 2MB (or larger, if the requested size exceeds 2MB) arrays are allocated as needed.
We could certainly do a lot better here (I imagine there's something existing in Garnet I could steal or repurpose), but this is mostly a proof we could get Lua 100% onto the managed heap. That said, I couldn't help put profile a little bit, so it shouldn't be awful given Lua's allocation patterns.
Open Questions
Is GC pressure actually needed in the
Tracked
case?Docs say:Which makes very little sense to me, as in a container (like a job) with memory limits the presence or lack of a finalizer seems irrelevant to whether the GC needs to be informed of native allocations?
Ultimately the .NET GC folks will just have to answer this one, I've opened a thread with them.Docs are (somewhat) incorrect here, and will be updated. It is correct, but not strictly necessary, to have these calls in the Tracked case. I'm leaving them in so the GC can respond more promptly to memory pressure.
What is expected behavior when a script is aborted?
This change introduces a case where a script might be aborted, and I expect future changes (timeouts, and potentially
SCRIPT|KILL
) to add more.Redis doesn't allow this - you are expected to let Redis crash, or force a shutdown, if a script goes out of control. That's kind of nuts, IMO, especially for any HA service.
However, by deviating from Redis (with this opt-in switch), we do need to define expected behavior.
Right now, the behavior is "any commands that executed in the script, executed". Commands cannot half-execute, but scripts can, basically.
Is this acceptable, or do we need some (presumably configurable) rollback behavior?
With transactions enabled we already know the scope of "needs to be rolled back", but the implementation would be non-trivial.
Decision: No rollbacks
Summarizing some discussion:
Benchmarks
I changed
ScriptOperations
to useLuaParams
instead ofOperationParams
as we were already ignoring most of the operation variants there. Now all Lua-related benchmarks run for with different allocators enabled: Native (the old behavior, and current default), Tracked w/ 2M limit, Tracked w/o a limit, Managed w/ 2M limit, and Managed w/o limit.main
results are as ofce21c248f084744e45bbff08d0ecce0a51326cca
.luaMemoryLimits
are as ofa2996e9ae5f7e9c44a8848e44cc91417ddf418c4
.Broadly speaking, we're giving up a bit of perf for the ability to recover from OOMs (and other runtime errors, technically). There's some work that could be done to claw bits of this back, in theory, but we are actually doing more with this change.
LuaRunnerOperations
Comparing the baseline and the
Native,None
case, we're giving up a small amount across the board. Worst case ~9%, though these are very fast (ns) already.main
luaMemoryLimits
LuaScriptCacheOperations
Cases where we construct a new
LuaRunner
are a bit slower, though most of these are in the error bounds.main
luaMemoryLimits
LuaScripts
Giving up ~32% in the worst case (comparing baseline to
Native,None
, Script4).main
luaMemoryLimits
ScriptOperations
This is more of a mixed bag, LargeScript is improved somewhat (~6%), while very basic evaluations like Eval and EvalSha are a bit slower. The loss is probably due to the pcall, and the gains are probably peanut butter improvements in calls to and from Lua from .NET.
main
(eliding Params != None)luaMemoryLimits