Skip to content

Conversation

@mkroening
Copy link
Member

@mkroening mkroening commented Oct 15, 2025

This PR completely reworks and unifies our thread-local storage implementation:

  1. A memory leak on ARM was fixed.
  2. We now test TLS on all architectures (was mistakenly disabled in 711a85c).
  3. We now have only one unified Tls struct instead of three architecture-dependent TaskTLS structs. This should make future changes much easier.
  4. On x86-64 and RISC-V, we now have only one layer of allocation indirections for managing the structure. This was already the case on ARM.
  5. We now allocate and free the TLS structure with the correct alignment. This was done wrong on all architectures in different ways before. Closes Large TLS alignments don't work #1677.
  6. TLS allocations are now managed through a separate Allocation abstraction.
  7. We no longer allocate an (unused) dtv structure on ARM (not done on the other platforms anyway). We can still add proper dtv support once we have the need for multiple TLS data blocks per thread in the future.
  8. We can now add proper Hermit-specific TLS data for in-kernel use that is accessed via the thread pointer instead of going through several indirections via the core pointer as we do now. This will be done in a follow-up PR.
  9. The code should now be properly documented and cited.

@mkroening mkroening self-assigned this Oct 15, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results

Benchmark Current: 87ffd83 Previous: 376b966 Performance Ratio
startup_benchmark Build Time 113.87 s 110.37 s 1.03
startup_benchmark File Size 0.91 MB 0.90 MB 1.00
Startup Time - 1 core 0.91 s (±0.03 s) 0.93 s (±0.03 s) 0.99
Startup Time - 2 cores 0.93 s (±0.03 s) 0.93 s (±0.03 s) 0.99
Startup Time - 4 cores 0.91 s (±0.04 s) 0.93 s (±0.03 s) 0.98
multithreaded_benchmark Build Time 113.13 s 114.11 s 0.99
multithreaded_benchmark File Size 1.01 MB 1.01 MB 1.00
Multithreaded Pi Efficiency - 2 Threads 87.82 % (±9.31 %) 87.54 % (±7.72 %) 1.00
Multithreaded Pi Efficiency - 4 Threads 42.65 % (±3.09 %) 43.85 % (±3.71 %) 0.97
Multithreaded Pi Efficiency - 8 Threads 24.69 % (±1.84 %) 25.11 % (±2.07 %) 0.98
micro_benchmarks Build Time 106.88 s 106.92 s 1.00
micro_benchmarks File Size 1.01 MB 1.01 MB 1.00
Scheduling time - 1 thread 63.81 ticks (±4.16 ticks) 67.03 ticks (±3.54 ticks) 0.95
Scheduling time - 2 threads 35.64 ticks (±4.26 ticks) 38.31 ticks (±5.49 ticks) 0.93
Micro - Time for syscall (getpid) 3.33 ticks (±0.51 ticks) 2.98 ticks (±0.45 ticks) 1.12
Memcpy speed - (built_in) block size 4096 79404.09 MByte/s (±54932.31 MByte/s) 78287.67 MByte/s (±54207.29 MByte/s) 1.01
Memcpy speed - (built_in) block size 1048576 42840.65 MByte/s (±29650.14 MByte/s) 42758.95 MByte/s (±29603.37 MByte/s) 1.00
Memcpy speed - (built_in) block size 16777216 28783.83 MByte/s (±23658.97 MByte/s) 28796.76 MByte/s (±23654.74 MByte/s) 1.00
Memset speed - (built_in) block size 4096 79293.26 MByte/s (±54844.14 MByte/s) 78337.85 MByte/s (±54246.52 MByte/s) 1.01
Memset speed - (built_in) block size 1048576 43067.81 MByte/s (±29806.83 MByte/s) 43001.55 MByte/s (±29767.07 MByte/s) 1.00
Memset speed - (built_in) block size 16777216 29535.47 MByte/s (±24094.08 MByte/s) 29553.13 MByte/s (±24095.41 MByte/s) 1.00
Memcpy speed - (rust) block size 4096 70444.41 MByte/s (±49435.58 MByte/s) 70495.32 MByte/s (±49405.95 MByte/s) 1.00
Memcpy speed - (rust) block size 1048576 42628.03 MByte/s (±29535.56 MByte/s) 42856.55 MByte/s (±29682.27 MByte/s) 0.99
Memcpy speed - (rust) block size 16777216 29035.32 MByte/s (±23864.77 MByte/s) 28900.71 MByte/s (±23739.86 MByte/s) 1.00
Memset speed - (rust) block size 4096 71078.44 MByte/s (±49887.51 MByte/s) 69943.60 MByte/s (±49128.48 MByte/s) 1.02
Memset speed - (rust) block size 1048576 42890.02 MByte/s (±29712.30 MByte/s) 43095.53 MByte/s (±29843.91 MByte/s) 1.00
Memset speed - (rust) block size 16777216 29792.76 MByte/s (±24301.19 MByte/s) 29647.15 MByte/s (±24171.90 MByte/s) 1.00
alloc_benchmarks Build Time 104.78 s 105.40 s 0.99
alloc_benchmarks File Size 0.97 MB 0.97 MB 1.00
Allocations - Allocation success 100.00 % 100.00 % 1
Allocations - Deallocation success 69.98 % (±0.28 %) 70.03 % (±0.30 %) 1.00
Allocations - Pre-fail Allocations 100.00 % 100.00 % 1
Allocations - Average Allocation time 12284.97 Ticks (±215.57 Ticks) 12298.03 Ticks (±230.53 Ticks) 1.00
Allocations - Average Allocation time (no fail) 12284.97 Ticks (±215.57 Ticks) 12298.03 Ticks (±230.53 Ticks) 1.00
Allocations - Average Deallocation time 725.45 Ticks (±92.80 Ticks) 722.17 Ticks (±103.75 Ticks) 1.00
mutex_benchmark Build Time 104.50 s 105.06 s 0.99
mutex_benchmark File Size 1.01 MB 1.02 MB 1.00
Mutex Stress Test Average Time per Iteration - 1 Threads 12.54 ns (±0.88 ns) 12.74 ns (±0.74 ns) 0.98
Mutex Stress Test Average Time per Iteration - 2 Threads 13.98 ns (±1.26 ns) 14.80 ns (±1.47 ns) 0.94

This comment was automatically generated by workflow using github-action-benchmark.

@mkroening mkroening changed the title fix(tls): use correct alignment fix(tls): rework TLS, use correct alignments Oct 15, 2025
@mkroening mkroening changed the title fix(tls): rework TLS, use correct alignments fix: rework TLS, use correct alignments Oct 15, 2025
@mkroening mkroening changed the title fix: rework TLS, use correct alignments fix: rework TLS, fix alignments Oct 16, 2025
@mkroening mkroening marked this pull request as ready for review October 16, 2025 10:27
@mkroening mkroening added this pull request to the merge queue Oct 20, 2025
Merged via the queue into main with commit 8e08755 Oct 20, 2025
17 checks passed
@mkroening mkroening deleted the align-tls branch October 20, 2025 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large TLS alignments don't work

2 participants