If you can program it slowly, you can program it quickly!
- Algorithms for Modern Hardware: Guide to writing algorithms optimized for modern computers
- Mike Acton - Data-oriented design and C++ (CppCon 2014): Classic talk explaining how popular software development methodologies result in software that has bad performance.
- Andrew Kelley - Practical DOD: How to implement some of the insights from Mike Acton's talk.
- Why (Most) Sampling Java Profilers Are Terrible: Many popular Java profilers suffer from safepoint bias
- async-profiler Java profiler: A Java profiler that doesn't suffer from safepoint bias.
- Carl Cook - When a Microsecond Is an Eternity: High Performance Trading Systems in C++ (CppCon 2017): Famous cppcon talk on HFT
- The Arm Manga Guide to the Mali GPU
- Refterm: Lectures on implementing a peformant terminal. Worth a watch even if you aren't interested in implementing a terminal.
- Static search trees: 40x faster than binary search: Implementing an optimized static search tree
- David Gross - Trading at light speed: designing low latency systems in C++ (Meeting C++ 2022): More recent talk on c++ systems optimization
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog: A very in-depth article about optimizing a matrix multiplication operation in CUDA.
- The NumPy array: a structure for efficient numerical computation: A paper detailing how the NumPy array works under the hood for efficient computation.