[Strace] gVisor taking extremely longer time with LightGBM training

### Description

Hi team,

We have recently found an interesting issue with gVisor. Following is the Python script we ran:

```
import lightgbm as lgb
import sys
from numpy.random import seed
from numpy.random import randint

def lightgbm_method(num_jobs):
  count = 1000
  seed(1)
  data = []
  for _ in range(count):
    data.append(randint(0, 100, 5))
  labels = randint(0, 100, count)
  clf = lgb.LGBMClassifier(n_jobs=num_jobs)
  clf.fit(data, labels)
  return 0

lightgbm_method(int(sys.argv[1]))
```

For some reason the runtime largely depends on the `num_jobs` passed to the function. Following is the runtime of this script with the `num_jobs` passed:

Running on node with 8 physical cores (c6gd.2xlarge):
| num_jobs | Native Kernel (seconds) | gVisor on systrap (seconds) | gVisor on ptrace (seconds) |
| ------------- | ------------- | ------------- | ------------- |
| 1  | 4.42  | 9.50 | 44.09 |
| 2  | 4.16  | 13.14 | 68.04 |
| 4  | 4.07  | 12.82 | 60.33 |
| 7  | 4.71  | 15.28 | 51.40 |
| 8  | 5.62  | **361.35** | 56.54 |
| 9  | 31.25  | 35.37 | 164.59 |
| 10  | 34.31  | 34.99 | 178.73 |

Running on node with 16 physical cores (r7gd.4xlarge):
| num_jobs | Native Kernel (seconds) | gVisor on systrap (seconds) | gVisor on ptrace (seconds) |
| ------------- | ------------- | ------------- | ------------- |
| 1  | 3.59  | 8.42 | 33.54 |
| 2  | 3.49  | 11.84 | 51.37 |
| 4  | 3.34  | 11.60 | 48.88 |
| 8  | 4.66 | 13.58 | 49.61 |
| 15  | 26.38  | 189.51 | 170.27 |
| 16  | 75.84  | 272.24 | 220.72 |
| 17  | 76.74  | 67.99 | 248.44 |


Above numbers are very consistent in our environment.

### Observations
1. There seems to be a pattern that with this job, it would take significantly longer time (up to 70 times longer!) to finish when `num_jobs` is set to equal to the number of physical cores on host.
2. When num_jobs is not passed, `lgb.LGBMClassifier` takes in default value to be same as physical cores. This makes the worst case to be the default case
3. However, when setting `OMP_THREAD_LIMIT` env variable to 1, even num_jobs is equal to physical cores, the job takes very fast to complete.
4. With `ptrace` platform, it takes longer to complete in general. However, when `num_jobs` is close to physical cores, `ptrace` actually surpasses `systrap`. This might indicate some issues in `systrap`
5. There is a known issue on lightgbm with OpenMP that [multi-threading with lightgbm could be hanging](https://lightgbm.readthedocs.io/en/latest/FAQ.html#lightgbm-hangs-when-multithreading-openmp-and-using-forking-in-linux-at-the-same-time). We followed the step to set the `num_threads=1`, and the issue no longer exists. But it is still not clear if the performance degradation is caused by this issue, as we do not observe same level of degradation with native kernel.

Could you please help us understand the degradation we are seeing here, especially the case with # of physical cores is 8 and `num_jobs` is also set to 8? Why would gVisor suddenly takes ~70 times slower than native kernel?  

### Steps to reproduce

Python script to reproduce:

```
import lightgbm as lgb
import sys
from numpy.random import seed
from numpy.random import randint

def lightgbm_method(num_jobs):
  count = 1000
  seed(1)
  data = []
  for _ in range(count):
    data.append(randint(0, 100, 5))
  labels = randint(0, 100, count)
  clf = lgb.LGBMClassifier(n_jobs=num_jobs)
  clf.fit(data, labels)
  return 0

lightgbm_method(int(sys.argv[1]))
```

### runsc version

```shell
runsc version release-20241217.0-40-gfe855beceea5-dirty
spec: 1.1.0-rc.1
```

### docker version (if using docker)

```shell

```

### uname

Linux ws-uswest2-2-e20c 5.10.215-203.850.amzn2.aarch64 #1 SMP Tue Apr 23 20:32:21 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

### kubectl (if using Kubernetes)

```shell

```

### repo state (if built from source)

_No response_

### runsc debug logs (if available)

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Strace] gVisor taking extremely longer time with LightGBM training #11431

Description

Observations

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

num_jobs	Native Kernel (seconds)	gVisor on systrap (seconds)	gVisor on ptrace (seconds)
1	4.42	9.50	44.09
2	4.16	13.14	68.04
4	4.07	12.82	60.33
7	4.71	15.28	51.40
8	5.62	361.35	56.54
9	31.25	35.37	164.59
10	34.31	34.99	178.73

num_jobs	Native Kernel (seconds)	gVisor on systrap (seconds)	gVisor on ptrace (seconds)
1	3.59	8.42	33.54
2	3.49	11.84	51.37
4	3.34	11.60	48.88
8	4.66	13.58	49.61
15	26.38	189.51	170.27
16	75.84	272.24	220.72
17	76.74	67.99	248.44

[Strace] gVisor taking extremely longer time with LightGBM training #11431

Description

Description

Observations

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions