Skip to content

Potential perf improvements #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matklad opened this issue Nov 19, 2021 · 4 comments
Open

Potential perf improvements #26

matklad opened this issue Nov 19, 2021 · 4 comments

Comments

@matklad
Copy link
Contributor

matklad commented Nov 19, 2021

  1. Look at what APIs jemalloc itself has for introspection
  2. Replace pub static TID: RefCell<usize> = RefCell::new(0); and the like with pub static TID: Cell<usize> = Cell::new(0);
  3. Replace with const thread locals once it is stable Tracking Issue for const-initialized thread locals rust-lang/rust#84223
  4. Replace with thread::current().id().as_u64() once that is stable
  5. In general, minimize usage of thread-locals, they are quite slow in today's Rust
  6. Spread out MEM_SIZE array more to avoid false sharing. Threads get sequential IDs, so may hammer the same cache line during allocation
  7. Replace SeqCst with relaxed, I think that should be enough for our use case.
  8. Remove TOTAL_MEMORY_USAGE as it's the sum of per-thread usages.
  9. rand::thread_rng().gen_range(0, 100) feels like something optimizable: thread_rng is crypto secure I thing (or at least stroger than xorshift).
  10. https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html I think uses "counting memory usage by threads" as one of the running examples, might be worth reading (warning: deep rabbit hole).
  11. https://blog.mozilla.org/nnethercote/category/memory-allocation/ might also be an interesting read
@pmnoxx
Copy link
Contributor

pmnoxx commented Dec 4, 2021

@matklad Thanks. I added a benchmark, which shows performance degradation: #27

@pmnoxx pmnoxx self-assigned this Dec 5, 2021
@pmnoxx
Copy link
Contributor

pmnoxx commented Dec 16, 2021

7. Remove TOTAL_MEMORY_USAGE as it's the sum of per-thread usages.

@matklad I did some bench-marking, it saves round 0.4-0.6ns from each run.
#52

Though, this benchmark is single threaded. This variable could potentially, affect multi threaded applications more during high congestion.

@pmnoxx
Copy link
Contributor

pmnoxx commented Dec 16, 2021

8. rand::thread_rng().gen_range(0, 100) feels like something optimizable: thread_rng is crypto secure I thing (or at least stroger than xorshift).

Yes, that's a big issue. I reduced time from 38ns to 12ns, by removing usage of rand. #53

@pmnoxx
Copy link
Contributor

pmnoxx commented Dec 16, 2021

6. Replace SeqCst with relaxed, I think that should be enough for our use case.

I measured, there is no performance between using ::Relaxed vs ::SeqCst.
See https://stackoverflow.com/questions/53805142/why-memory-order-relaxed-performance-is-the-same-as-memory-order-seq-cst/53805377

@pmnoxx pmnoxx removed their assignment Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants