-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015
[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel)
a discussion (no related file):
Why does this fix the issue? What was the reason for the segfault?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @mkow)
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
Why does this fix the issue? What was the reason for the segfault?
Below is the error log with SIGSEGV:
(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))
Using the above addresses we could get the functions abort->hlt
from objdump of libc.
We tried to debug the issue using GDB to figure out the cause of abort
call inside libc
but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like some race
in Gramine when using pal_describe_location
only with specific configuration.
Hence we plan to remove this function call in coming release and analyze this issue after the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @jkr0103)
a discussion (no related file):
Previously, jkr0103 (Jitender Kumar) wrote…
Below is the error log with SIGSEGV:
(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))
Using the above addresses we could get the functions
abort->hlt
from objdump of libc.We tried to debug the issue using GDB to figure out the cause of
abort
call insidelibc
but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like somerace
in Gramine when usingpal_describe_location
only with specific configuration.Hence we plan to remove this function call in coming release and analyze this issue after the release.
Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @jkr0103)
-- commits
line 2 at r1:
This should be like this:
[PAL/{Linux,Linux-SGX}] Do not try to describe RIP location on syscall instruction
Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.
For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.
As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.
Signed-off-by: Adarsh Anand <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)
a discussion (no related file):
Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...
We tried to debug this, but failed. The issue was (for some reason) reproducible only in a particular Ubuntu 20.04 Docker container on a particular machine.
The issue seems to be a memory corruption, as it is not deterministic. In fact, the message that Jitender and Anjali put here ( error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL
) seems to me misleading, as this error is just a manifestation of Glibc's abort()
function that performs a HLT instruction -- and Gramine is not expecting the untrusted runtime's Glibc to execute such HLT instructions.
Anyway, experimenting with different stack sizes didn't help at all. This was my first guess: a stack buffer overflow, since we allocate a 128-byte buffer on a signal-handling stack. However, this is not the root cause.
Next theory was (and still is) that there is some incorrect locking/synchronization inside pal_describe_location()
. But inspecting that function, I don't see anything wrong -- all sensitive operations are protected by locks. It does feel though like iterating through a GDB-specific list of loaded binaries has a bug somewhere, but I couldn't pinpoint where. This would explain both requirements for this bug: (1) the bug manifests only in debug build of Gramine, and (2) the bug is a memory corruption that seems to happen in the untrusted PAL.
This can probably be debugged in 2-3 days by a senior dev, but I am personally not taking any such requests anymore. And the time for the new Gramine release is approaching, so my recommendation would be to merge this PR, and properly debug later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)
pal/src/host/linux/pal_exception.c
line 94 at r1 (raw file):
" patching your application to use Gramine syscall API."); } log_trace("Emulating raw syscall instruction with number %d at address %p", info->si_syscall,
Please add above this line a comment so that we don't forget in the future:
/* FIXME: better to use `pal_describe_location()` (see example below), but it leads to a
* non-deterministic bug on some workloads */
pal/src/host/linux-sgx/pal_exception.c
line 235 at r1 (raw file):
" patching your application to use Gramine syscall API."); } log_trace("Emulating raw syscall instruction with number %lu at address %p",
ditto (addd comment)
b239fbb
to
671d944
Compare
…l instruction Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls") added a debug print for every encountered raw syscall instruction. Each print describes what system call number is invoked and at which address in the binary. The address is fed to the helper function `pal_describe_location(uc->rip)` which tries to find the binary + function name and put them in human-readable form into the provided buffer. For some reason, the snippet used in that commit (allocating a 128-byte buffer on signal-handling stack and calling `pal_describe_location()`) leads to a non-deterministic memory corruption on some workloads. That bug is hardly reproducible, and it is not fixed by e.g. increasing the signal stack size inside Gramine. The `pal_describe_location()` control path also seems correct. As the true root cause for this bug is not yet found, this commit introduces a workaround: temporarily removing the stack-allocated buffer and the correspoding `pal_describe_location()` call, and instead printing the raw RIP value. Signed-off-by: Adarsh Anand <[email protected]>
105ff99
to
522b2b3
Compare
…l instruction Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls") added a debug print for every encountered raw syscall instruction. Each print describes what system call number is invoked and at which address in the binary. The address is fed to the helper function `pal_describe_location(uc->rip)` which tries to find the binary + function name and put them in human-readable form into the provided buffer. For some reason, the snippet used in that commit (allocating a 128-byte buffer on signal-handling stack and calling `pal_describe_location()`) leads to a non-deterministic memory corruption on some workloads. That bug is hardly reproducible, and it is not fixed by e.g. increasing the signal stack size inside Gramine. The `pal_describe_location()` control path also seems correct. As the true root cause for this bug is not yet found, this commit introduces a workaround: temporarily removing the stack-allocated buffer and the correspoding `pal_describe_location()` call, and instead printing the raw RIP value. Signed-off-by: Adarsh Anand <[email protected]>
This PR has been moved to #2017 |
Description of the Changes
PR #1991 exposes a segmentation fault with the use of
pal_describe_location
as in below error log:Using the above addresses we could find the functions abort->hlt from objdump of libc.
We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue did not reproduced with GDB even after multiple tries.
Issue is consistent with Pytorch workload using curated apps (with debug build) without GDB, on a Azure VM, rarely on our lab servers.
Also tested by increasing the ENCLAVE_SIG_STACK_SIZE and ALT_STACK_SIZE size to 256KB but that also did not help.
Likely some race in Gramine when using
pal_describe_location
which we are not hitting with GDB as GDB makes the execution slow.We want to remove this function call in coming release and analyze this issue after the release.
Fixes gramineproject/contrib#87
This change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"