Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function #2015

Conversation

adarshan-intel
Copy link
Contributor

@adarshan-intel adarshan-intel commented Oct 4, 2024

Description of the Changes

PR #1991 exposes a segmentation fault with the use of pal_describe_location as in below error log:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could find the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue did not reproduced with GDB even after multiple tries.

Issue is consistent with Pytorch workload using curated apps (with debug build) without GDB, on a Azure VM, rarely on our lab servers.

Also tested by increasing the ENCLAVE_SIG_STACK_SIZE and ALT_STACK_SIZE size to 256KB but that also did not help.

Likely some race in Gramine when using pal_describe_location which we are not hitting with GDB as GDB makes the execution slow.

We want to remove this function call in coming release and analyze this issue after the release.

Fixes gramineproject/contrib#87


This change is Reviewable

@adarshan-intel adarshan-intel changed the title [PAL/{Linux,Linux-SGX}] Removing buffer and describe location functions [PAL/{Linux,Linux-SGX}] Removing buffer and pal_describe_location function Oct 4, 2024
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel)

a discussion (no related file):
Why does this fix the issue? What was the reason for the segfault?


Copy link
Contributor

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @mkow)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Why does this fix the issue? What was the reason for the segfault?

Below is the error log with SIGSEGV:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could get the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like some race in Gramine when using pal_describe_location only with specific configuration.

Hence we plan to remove this function call in coming release and analyze this issue after the release.


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @jkr0103)

a discussion (no related file):

Previously, jkr0103 (Jitender Kumar) wrote…

Below is the error log with SIGSEGV:

(host_exception.c:141:handle_sync_signal) error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL (libc-2.31.so+0x22941 (addr = 0x7f62e5f98941))

Using the above addresses we could get the functions abort->hlt from objdump of libc.

We tried to debug the issue using GDB to figure out the cause of abort call inside libc but issue is not reproducing with GDB, tried multiple times but always reproduces without GDB, seems like some race in Gramine when using pal_describe_location only with specific configuration.

Hence we plan to remove this function call in coming release and analyze this issue after the release.

Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...


Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @jkr0103)


-- commits line 2 at r1:
This should be like this:

[PAL/{Linux,Linux-SGX}] Do not try to describe RIP location on syscall instruction

Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.

For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.

As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.

Signed-off-by: Adarsh Anand <[email protected]>

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)

a discussion (no related file):

Please debug this, it may be something serious. This PR sounds like randomly changing code until it stops crashing...

We tried to debug this, but failed. The issue was (for some reason) reproducible only in a particular Ubuntu 20.04 Docker container on a particular machine.

The issue seems to be a memory corruption, as it is not deterministic. In fact, the message that Jitender and Anjali put here ( error: Unexpected segmentation fault (SIGSEGV) occurred inside untrusted PAL) seems to me misleading, as this error is just a manifestation of Glibc's abort() function that performs a HLT instruction -- and Gramine is not expecting the untrusted runtime's Glibc to execute such HLT instructions.

Anyway, experimenting with different stack sizes didn't help at all. This was my first guess: a stack buffer overflow, since we allocate a 128-byte buffer on a signal-handling stack. However, this is not the root cause.

Next theory was (and still is) that there is some incorrect locking/synchronization inside pal_describe_location(). But inspecting that function, I don't see anything wrong -- all sensitive operations are protected by locks. It does feel though like iterating through a GDB-specific list of loaded binaries has a bug somewhere, but I couldn't pinpoint where. This would explain both requirements for this bug: (1) the bug manifests only in debug build of Gramine, and (2) the bug is a memory corruption that seems to happen in the untrusted PAL.

This can probably be debugged in 2-3 days by a senior dev, but I am personally not taking any such requests anymore. And the time for the new Gramine release is approaching, so my recommendation would be to merge this PR, and properly debug later.


Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), one-liner is not in imperative mood (waiting on @adarshan-intel and @mkow)


pal/src/host/linux/pal_exception.c line 94 at r1 (raw file):

                       " patching your application to use Gramine syscall API.");
        }
        log_trace("Emulating raw syscall instruction with number %d at address %p", info->si_syscall,

Please add above this line a comment so that we don't forget in the future:

/* FIXME: better to use `pal_describe_location()` (see example below), but it leads to a
 *        non-deterministic bug on some workloads */

pal/src/host/linux-sgx/pal_exception.c line 235 at r1 (raw file):

                       " patching your application to use Gramine syscall API.");
        }
        log_trace("Emulating raw syscall instruction with number %lu at address %p",

ditto (addd comment)

@adarshan-intel adarshan-intel force-pushed the adarsh/raw_syscall_fix branch from b239fbb to 671d944 Compare October 7, 2024 07:41
…l instruction

Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.

For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.

As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.

Signed-off-by: Adarsh Anand <[email protected]>
@adarshan-intel adarshan-intel force-pushed the adarsh/raw_syscall_fix branch from 105ff99 to 522b2b3 Compare October 7, 2024 07:53
…l instruction

Commit b6a2d79 ("[PAL/{Linux,Linux-SGX}] Add trace log for raw syscalls")
added a debug print for every encountered raw syscall instruction. Each print
describes what system call number is invoked and at which address in the binary.
The address is fed to the helper function `pal_describe_location(uc->rip)` which
tries to find the binary + function name and put them in human-readable form
into the provided buffer.

For some reason, the snippet used in that commit (allocating a 128-byte buffer
on signal-handling stack and calling `pal_describe_location()`) leads to a
non-deterministic memory corruption on some workloads. That bug is hardly
reproducible, and it is not fixed by e.g. increasing the signal stack size inside
Gramine. The `pal_describe_location()` control path also seems correct.

As the true root cause for this bug is not yet found, this commit introduces
a workaround: temporarily removing the stack-allocated buffer and the
correspoding `pal_describe_location()` call, and instead printing the raw
RIP value.

Signed-off-by: Adarsh Anand <[email protected]>
@adarshan-intel adarshan-intel deleted the adarsh/raw_syscall_fix branch October 7, 2024 08:25
@adarshan-intel adarshan-intel restored the adarsh/raw_syscall_fix branch October 7, 2024 08:27
@adarshan-intel
Copy link
Contributor Author

This PR has been moved to #2017

@adarshan-intel adarshan-intel deleted the adarsh/raw_syscall_fix branch October 7, 2024 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pytorch with debug flag is getting failed with latest gramine and gsc master with Segmentation fault
4 participants