[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

adarshan-intel · 2024-10-04T10:30:43Z

Description of the Changes

PR #1991 exposes a segmentation fault with the use of pal_describe_location as in below error log:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could find the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue did not reproduced with GDB even after multiple tries.

Issue is consistent with Pytorch workload using curated apps (with debug build) without GDB, on a Azure VM, rarely on our lab servers.

Also tested by increasing the ENCLAVE_SIG_STACK_SIZE and ALT_STACK_SIZE size to 256KB but that also did not help.

Likely some race in Gramine when using pal_describe_location which we are not hitting with GDB as GDB makes the execution slow.

We want to remove this function call in coming release and analyze this issue after the release.

Fixes gramineproject/contrib#87

This change is

mkow

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel)

a discussion (no related file):
Why does this fix the issue? What was the reason for the segfault?

jkr0103

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @mkow)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Why does this fix the issue? What was the reason for the segfault?

Below is the error log with SIGSEGV:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could get the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like some race in Gramine when using pal_describe_location only with specific configuration.

Hence we plan to remove this function call in coming release and analyze this issue after the release.

mkow

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @jkr0103)

a discussion (no related file):

Previously, jkr0103 (Jitender Kumar) wrote…

Below is the error log with SIGSEGV:
(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))
Using the above addresses we could get the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like some race in Gramine when using pal_describe_location only with specific configuration.

Hence we plan to remove this function call in coming release and analyze this issue after the release.

Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...

dimakuv

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @jkr0103)

-- commits line 2 at r1:
This should be like this:

[PAL/{Linux,Linux-SGX}] Do not try to describe RIP location on syscall instruction

Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.

For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.

As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.

Signed-off-by: Adarsh Anand <[email protected]>

dimakuv

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)

a discussion (no related file):

Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...

We tried to debug this, but failed. The issue was (for some reason) reproducible only in a particular Ubuntu 20.04 Docker container on a particular machine.

The issue seems to be a memory corruption, as it is not deterministic. In fact, the message that Jitender and Anjali put here ( error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL) seems to me misleading, as this error is just a manifestation of Glibc's abort() function that performs a HLT instruction -- and Gramine is not expecting the untrusted runtime's Glibc to execute such HLT instructions.

Anyway, experimenting with different stack sizes didn't help at all. This was my first guess: a stack buffer overflow, since we allocate a 128-byte buffer on a signal-handling stack. However, this is not the root cause.

Next theory was (and still is) that there is some incorrect locking/synchronization inside pal_describe_location(). But inspecting that function, I don't see anything wrong -- all sensitive operations are protected by locks. It does feel though like iterating through a GDB-specific list of loaded binaries has a bug somewhere, but I couldn't pinpoint where. This would explain both requirements for this bug: (1) the bug manifests only in debug build of Gramine, and (2) the bug is a memory corruption that seems to happen in the untrusted PAL.

This can probably be debugged in 2-3 days by a senior dev, but I am personally not taking any such requests anymore. And the time for the new Gramine release is approaching, so my recommendation would be to merge this PR, and properly debug later.

dimakuv

Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)

pal/src/host/linux/pal_exception.c line 94 at r1 (raw file):

                       " patching your application to use Gramine syscall API.");
        }
        log_trace("Emulating raw syscall instruction with number %d at address %p", info->si_syscall,

Please add above this line a comment so that we don't forget in the future:

/* FIXME: better to use `pal_describe_location()` (see example below), but it leads to a
 *        non-deterministic bug on some workloads */

pal/src/host/linux-sgx/pal_exception.c line 235 at r1 (raw file):

                       " patching your application to use Gramine syscall API.");
        }
        log_trace("Emulating raw syscall instruction with number %lu at address %p",

ditto (addd comment)

…l instruction Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls") added a debug print for every encountered raw syscall instruction. Each print describes what system call number is invoked and at which address in the binary. The address is fed to the helper function `pal_describe_location(uc->rip)` which tries to find the binary + function name and put them in human-readable form into the provided buffer. For some reason, the snippet used in that commit (allocating a 128-byte buffer on signal-handling stack and calling `pal_describe_location()`) leads to a non-deterministic memory corruption on some workloads. That bug is hardly reproducible, and it is not fixed by e.g. increasing the signal stack size inside Gramine. The `pal_describe_location()` control path also seems correct. As the true root cause for this bug is not yet found, this commit introduces a workaround: temporarily removing the stack-allocated buffer and the correspoding `pal_describe_location()` call, and instead printing the raw RIP value. Signed-off-by: Adarsh Anand <[email protected]>

adarshan-intel · 2024-10-07T08:33:49Z

This PR has been moved to #2017

adarshan-intel changed the title ~~[PAL/{Linux,Linux-SGX}] Removing buffer and describe location functions~~ [PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function Oct 4, 2024

mkow requested changes Oct 4, 2024

View reviewed changes

jkr0103 reviewed Oct 5, 2024

View reviewed changes

mkow requested changes Oct 5, 2024

View reviewed changes

dimakuv suggested changes Oct 7, 2024

View reviewed changes

dimakuv reviewed Oct 7, 2024

View reviewed changes

dimakuv suggested changes Oct 7, 2024

View reviewed changes

adarshan-intel force-pushed the adarsh/raw_syscall_fix branch from b239fbb to 671d944 Compare October 7, 2024 07:41

adarshan-intel force-pushed the adarsh/raw_syscall_fix branch from 105ff99 to 522b2b3 Compare October 7, 2024 07:53

adarshan-intel closed this Oct 7, 2024

adarshan-intel deleted the adarsh/raw_syscall_fix branch October 7, 2024 08:25

adarshan-intel restored the adarsh/raw_syscall_fix branch October 7, 2024 08:27

adarshan-intel deleted the adarsh/raw_syscall_fix branch October 7, 2024 08:34

mkow mentioned this pull request Oct 7, 2024

[PAL] Do not describe RIP location on syscall instruction #2017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

adarshan-intel commented Oct 4, 2024 •

edited by dimakuv

Loading

mkow left a comment

jkr0103 left a comment

mkow left a comment

dimakuv left a comment

dimakuv left a comment

dimakuv left a comment

adarshan-intel commented Oct 7, 2024

[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

Conversation

adarshan-intel commented Oct 4, 2024 • edited by dimakuv Loading

Description of the Changes

mkow left a comment

Choose a reason for hiding this comment

jkr0103 left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

adarshan-intel commented Oct 7, 2024

adarshan-intel commented Oct 4, 2024 •

edited by dimakuv

Loading