Improve BVH Broadphase Performance #5669

sturnclaw · 2023-11-21T06:24:29Z

This PR refactors our CollisionSpace broadphase collision acceleration structure to be significantly more efficient. With no intersections (the usual case for the root frame of a system), the broadphase queries run approximately 3.5x to 4x faster.

I no longer have the performance numbers for this branch due to how many intervening PRs I've needed to merge, but the broadphase cost wasn't a significant performance sink to start with (on the order of 0.3-0.5ms / frame in the Sol system) so don't expect a huge FPS boost - this is mostly attacking low-hanging fruit which would have been a greater problem further down the line.

I've not yet tried to apply this optimization to our narrow-phase BVHTree implementation - I have a different idea which involves using single-precision floats and splitting large Geoms as part of the import step to fit within the precision limits of a regular float-based AABB.

The performance gain from this PR is mostly due to completely removing the need for a std::list<Geom *> during iteration, reducing the amount of allocations needed per-frame (can re-use the array memory if not significantly increasing the number of geoms or altering the topology), and cramming everything needed for a node into a single cache line.

The (rather abysmal) prior cache inefficiency was the largest factor responsible for the time cost involved in rebuilding and traversing the broadphase BVH, though once bodies start overlapping the narrow-phase collision tri-edge-edge-tri test still significantly dominates the runtime cost.

CC @JonBooth78 to have a gander at the physics code and hopefully improve your familiarity with that area of the codebase. 😄

- No sense in maintaining a static table of collision space pointers to index into - Any cache miss during CollideFrames() is irrelevant compared to cost of collisions

JonBooth78 · 2023-11-22T03:50:36Z

src/Aabb.h

+
+	auto IsIn(Vec3 point) const { return point >= min && point <= max; }
+	auto Intersects(const AABB &rhs) const { return min < rhs.max && max > rhs.min; }
+};


This type confuses me a little. We have some things using a min() max() method that takes two vectors (inside IntersectsRay) and other things doing component wise min and max.

Then our Invalid method seems to have determined that we're using doubles as our component type for our vectors and yet 0.f inside our IntersectsRay method suggests floats.

Is the intention to proved a SIMD optimized variant of Vector3 that has correct alignment constraints and operator overload functionality?

The intention for this specific class is to provide a templated, optimized version of the existing Axis-Aligned-Bounding-Box implementation (Aabb) which can operate semi-generically on double-precision, single-precision, or packed SIMD vectors, by changing the Vec3 type provided. E.g. AABB<vector3<xsimd::batch<float, AVX2>>> (not exact code) would be an AABB storing eight single-precision float elements "per component" with no extra code compared to the scalar version - it's the responsibility of the calling code to unpack individual elements back to scalar datatypes if necessary.

Then our Invalid method seems to have determined that we're using doubles as our component type for our vectors and yet 0.f inside our IntersectsRay method suggests floats.

This is a copy-paste error - it should be Number(0) instead.

We have some things using a min() max() method that takes two vectors (inside IntersectsRay) and other things doing component wise min and max.

The min/max on vectors is just a nicer interface compared to manually writing std::min(a.x, b.x) etc. everywhere the operation is performed. It uses the same semantics as packed SIMD min/max operations, so it operates per-element and per-component. The component-wise min/max you're seeing in IntersectsRay is a "horizontal min/max" (taking the minimum of all elements in a single vector) and that's the most optimized way to express it.

...that being said, I see now what you're talking about - I forgot to update the other component-wise calls when I added the min/max overload for our vector3 type. I also forgot to make the min/max overloads generic enough to use with a vectorized element_type. Will fix.

Ahh so there is a min/max that returns a scaler of the min/max of a single vector's components and then component-wise min/max too.

I believe most SIMD implementations should accelerate at least the component wise ones to make them single instructions, I'm not sure if we write them component wise though, the compiler will figure that out for us.

src/collider/BVHTree.cpp

JonBooth78 · 2023-11-22T04:07:14Z

src/collider/BVHTree.cpp

+
+	// Remove the (normalized) SAH cost of the root node from the result
+	// The SAH metric doesn't include the cost of the root node
+	outSAH -= 1.f;


Similar oddness around float/double dichotomy here as above.

I also find it a little odd that our Keys and pivots are floating point vectors when we're dealing in double space too, although probably just fine,

The keys/pivots are actually completely intended to be in single-precision compared to the actual datatypes - SingleBVHTree::Build is designed to try to squeeze out as much precision as possible from the mantissa by normalizing all coordinates over the total bounding volume.

This decision was made because the precision of the sort key/pivot doesn't affect the correctness of computing intersections in the final tree (that's all done in double-precision), but the cache efficiency of the sort step is very important for tree re-build performance as it's an O(n log n) traversal of the whole list of entries in the tree.

Yeah, I had a feeling this was totally deliberate, interesting.

Ahh yes the whole optimize the precision of the mantissa, been ages since I thought about that - I kind of assumed that for doubles it wouldn't matter so much. (Last time I remember thinking about this I think was when doing 7e3 math for color channels for XBox360 HDR).

JonBooth78 · 2023-11-22T04:12:31Z

Thanks @Web-eWorks , interesting happy to take a look. Overall looks very sensible. There are a few odd places where we do float/double conversions (there was one or two more in CollisionSpace.cpp that I'm not sure about in terms of correctness, or if it matters.

I can't see the problem that I've hit when predominantly working in float space and using double literal. e.g, float x = y * 2 where y is also a float, will actually promote y to a double, multiply it by 2 and then cast back to a float, massively slower than float x = y * 2.f, which is probably what was intended. Or at least this is how compilers used to behave.

sturnclaw · 2023-11-22T04:23:13Z

I can't see the problem that I've hit when predominantly working in float space and using double literal. e.g, float x = y * 2 where y is also a float, will actually promote y to a double, multiply it by 2 and then cast back to a float, massively slower than float x = y * 2.f, which is probably what was intended. Or at least this is how compilers used to behave.

I'm pretty sure this is how compilers still behave - automatic promotion to larger datatypes is still a thing. I believe the compiler will promote a 2.f at compile time to a double when working with other double-precision datatypes so there's no speed loss there, but in single-precision code I believe "correctness" requires that speed loss. Will keep a sharper eye out for those kind of issues, thanks!

- Provide component-all-of >/</>=/<= operators for vector3 - Add higher-performance Aabb update/construction operations - Add overload for std::min / std::max in vector3's namespace

- Drops the (unneeded) radius element of the old AABB implementation - Templated to support 2x double / 4x float SIMD element types for additional optimization

- Broadphase queries need fast access to position data in the Geom object - Previous data layout needed to load 3x cache lines to get to the position data - Broadphase query performance is still an issue when Geoms are scattered all over the heap

- Replaces BvhTree for broadphase physics performance - 3-4.5x speedup due to significant improvement in cache occupancy - Can trivially multithread new broadphase intersection code

- Add new TraceRay backend - Remove all usage of std::list - Significantly improves broadphase performance compared to old implementation

Sadly, we can't have nice things that C99 has had for 24 years.

JonBooth78 · 2023-11-22T05:05:57Z

I once saw more than an extra 5% framerate on (I think it was a PS2), by the result of sprinkling around some fs to make things floating point literals and not do double math like that. As I recall there was a few right in tight loops of the graphics and collision culling (so very similar areas to this). So I'm now super pedantic about it :-D

sturnclaw added Performance C++ code labels Nov 21, 2023

Frame owns CollisionSpace pointer

603bae4

- No sense in maintaining a static table of collision space pointers to index into - Any cache miss during CollideFrames() is irrelevant compared to cost of collisions

sturnclaw force-pushed the bvh-perf branch from e762571 to 6ffcfcf Compare November 22, 2023 02:12