-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] split should not happen immediately after fallback on memory contention (OOM state machine bug) #12158
[BUG] split should not happen immediately after fallback on memory contention (OOM state machine bug) #12158
Comments
Okay I reproduced the error with the transition log enabled and I understand the problem now. It looks like it is an issue with the test. When a rollback happens all of the memory needs to be freed or made spillable. Here is a summary of the log transitions with some notes. THREAD_1 -> RUNNING // INIT THREAD If I update the test to free
|
The reason this works is because |
Hi Bobby, the test case is an abstraction from my real world case.
The above code will thrown the same exception, the reason is that when thread2 fall back, it will not necessarily trigger its content in the SpillFramework, so thread1 will never see the DEALLOC event. One workaround would be making sure |
I spoke with @binmahone and the code is complex enough that I will try and make some changes to the state machine to let us retry the allocation before we go to BUFN_THROW for the last thread. The alternative would be to update the state machine each time we make a buffer spillable. |
Describe the bug
Suppose we have two threads T0 and T1, both running a withRetryNoSplit block.
From my obersevation from logs, below is what seems to be happening:
However, the expected behavior should be: "If we are at a point where everything is rolled back except for a single thread and it cannot make progress, then we ask it to try and make the input smaller and try again." (quoted from the design doc), so at step 6, shouldn't T0 at least try to run the code block again without doing split? At this time T0 is exlusively running, so it should be able to make progress (T0 can succeed if T1 is not contending memory with it).
Split should be the last resort if T0 fails again in the solo run.
Steps/Code to reproduce bug
The problem can be reproduced by adding a new test case into HostAllocSuite. It's worth mentioning that although this example is for HostAlloc, the same logic applies for GPU memory, as they both share the same underlying OOM state machines.
will throw:
Expected behavior
We expect both thread1 and thread2 to fall back to BUFN state, and then thread1 should attempt another try before doing splitting, succeed, and then thread2 also make a try and succeed.
Internally, we might need to add a state called THREAD_EXLU_RUNNING or THREAD_SOLO_RUNNING to the OOM state machine.
The text was updated successfully, but these errors were encountered: