Skip to content

Implement batched query support #6161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 238 additions & 0 deletions benches/benches/bevy_ecs/iteration/batched_compute.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
use bevy_ecs::prelude::*;
use core::arch::x86_64::*;
use glam::*;
use rand::prelude::*;

use criterion::BenchmarkId;
use criterion::Criterion;

#[derive(Component, Copy, Clone, Default)]
struct Position(Vec3);

#[derive(Component, Copy, Clone, Default)]
#[repr(transparent)]
struct Health(f32);

//A hyperplane describing solid geometry, (x,y,z) = n with d such that nx + d = 0
#[derive(Component, Copy, Clone, Default)]
struct Wall(Vec3, f32);

struct Benchmark(World);

fn rnd_vec3(rng: &mut ThreadRng) -> Vec3 {
let x1 = rng.gen_range(-16.0..=16.0);
let x2 = rng.gen_range(-16.0..=16.0);
let x3 = rng.gen_range(-16.0..=16.0);

Vec3::new(x1, x2, x3)
}

fn rnd_wall(rng: &mut ThreadRng) -> Wall {
let d = rng.gen_range(-16.0..=16.0);

Wall(rnd_vec3(rng).normalize_or_zero(), d)
}

// AoS to SoA data layout conversion for x86 AVX.
// This code has been adapted from:
// https://www.intel.com/content/dam/develop/external/us/en/documents/normvec-181650.pdf
#[inline(always)]
// This example is written in a way that benefits from inlined data layout conversion.
fn aos_to_soa_83(aos_inner: &[Vec3; 8]) -> [__m256; 3] {
unsafe {
//# SAFETY: Vec3 is repr(C) for x86_64
let mx0 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(0));
let mx1 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(4));
let mx2 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(8));
let mx3 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(12));
let mx4 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(16));
let mx5 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(20));
Comment on lines +43 to +49

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is glam::Vec3, this is unsound and repr(C) is not enough. Vec3, as opposed to Vec3A, has padding bytes. That is, its layout is:

f32 f32 f32 ?
init init init poison

so this is filling the 4th lane with data that should not be read. Vec3A is defined as __m128 already. You must use a full 16-byte type with zero padding, consistently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Indeed, you're right: the C (and C++) standards don't provide any guarantee against arbitrary padding being added between elements or at the end of a struct, even in the case it wouldn't be necessary to guarantee stride alignment in sequential array accesses. It's perfectly legitimate for a conforming C compiler to lay glam::Vec3 out in exactly the way you have suggested, including using a trap representation for the padding at the end of the struct. Even if glam::Vec3 were defined in terms of a single array [f32; 3], you still can't escape padding at the end as you have indicated, unless you made it repr(transparent) anyway.

In C++, the way we'd normally deal with this is with a static assert for sizeof(Vec3) == 3*sizeof(float). It's a hack, but it works to avoid this problem. Indeed, on all major x86 ABIs, this holds. Since static asserts don't really exist in Rust, I'm not sure how to accomplish something equivalent there in this example.

I'd like to mention though, that real x86 C (and C++) compilers would never, ever add padding for the sake of adding it. Padding is something done on x86 to ensure alignment, and the alignment of Vec3 is not required to be anything higher than the alignment of f32. It could be, and that would be legal in a theoretic sense for C, but no implementation out there -- on x86 -- would ever do it. For one thing, that would contradict the psABI and break compatibility with other code on x86 systems. However yes, one could construct a toy C compiler that would do just that, say to anticipate vectorization later on -- although I expect such a transform would be a pessimization a lot more often than not.

So, laying out Vec3 with 12 bytes of space and aligning to the same alignment of f32 meets the alignment requirements just fine -- and this is what real implementations -- on x86 -- do. Not that this excuses anything, mind you, because I agree: the code should be correct from a theoretic sense, or guarded by a compile-time guard to ward off the UB parts (as one would do in C++).

With all this said, as someone that isn't super in the know of the current Rust landscape in dealing with these things, what's the standard practice here? In C++ I'd stick a static assert on it and be done with it. How do we make this work in Rust?

Copy link

@workingjubilee workingjubilee Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it was very late at night when I scanned through things here, and I both mislaid some mental math while reading this and flubbed what I meant to say. My apologies. Let me correct myself: the safety documentation here needs to explicate the actual safety precondition which is the layout match of 8 * 3 / (6 * 4) == 1 and these are unaligned loads, instead of a handwave about repr(C), as if (n * x) / (m * y) == 1 doesn't hold, then you have the possibility of reading the poisoned value.

You needn't particularly worry about toy C compilers, and I wasn't trying to language lawyer you on those. Rust picks specific C compilers to be compatible with (though it gets more vague if e.g. two compilers emit "the same" ABI but actually disagree on fine details). Thus when targeting x86_64-pc-windows-msvc, obviously it emits compatible code with the Windows x64 ABI as emitted by MSVC, etc. etc. down the line. There are problems that arise when e.g. gcc and clang disagree on the ABI.

You can generate a "static assert" in Rust using the pattern

const _: () = {
    let precondition = {
        // some code that can run in const eval
    };
    if precondition == false {
        panic!("a nicely-written message")
    };
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem and thanks for the feedback. I wrote a good amount of this PR myself during some late nights 😅 It sounds like we're converging on a solution. Would it be acceptable to document this about the expected layout of Vec3 for x86 in the example, and use that pattern to statically assert this is the case for completeness? With maybe an additional comment about how a repr(transparent) type would be preferable for code that needs to be portable?

Copy link

@workingjubilee workingjubilee Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing repr(transparent) isn't needed (it's a bit of a red herring) but otherwise yes, that sounds fine. My real concern is just about someone retouching this code later without a deep and horrifying knowledge we have of the details of the x86-64 ABI, what __m128 even is, and C layout, in a way that changes the numbers and causes the the last _mm_loadu_ps to read 1 more float than actually exists.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just for my own understanding, can you clarify your statement about repr(transparent)? My understanding is that if Vec3 were defined as a repr(transparent) struct with a single [f32; 3] member, all padding issues go away. Indeed, this is similar advice to what you'd see in the C++ space. Is there something I'm not seeing there?

Unfortunately I can't provide any input on concerns about maintainability -- I think this kind of stuff just comes with the gamedev (and simulation) territory. At least this is user provided code, not library code in Bevy, so it's not like we have ECS internals doing this kind of thing.


let mut m03 = _mm256_castps128_ps256(mx0); // load lower halves
let mut m14 = _mm256_castps128_ps256(mx1);
let mut m25 = _mm256_castps128_ps256(mx2);
m03 = _mm256_insertf128_ps(m03, mx3, 1); // load upper halves
m14 = _mm256_insertf128_ps(m14, mx4, 1);
m25 = _mm256_insertf128_ps(m25, mx5, 1);

let xy = _mm256_shuffle_ps::<0b10011110>(m14, m25); // upper x's and y's
let yz = _mm256_shuffle_ps::<0b01001001>(m03, m14); // lower y's and z's
let x = _mm256_shuffle_ps::<0b10001100>(m03, xy);
let y = _mm256_shuffle_ps::<0b11011000>(yz, xy);
let z = _mm256_shuffle_ps::<0b11001101>(yz, m25);
[x, y, z]
}
}

impl Benchmark {
fn new(size: i32) -> Benchmark {
let mut world = World::new();

let mut rng = rand::thread_rng();

world.spawn_batch((0..size).map(|_| (Position(rnd_vec3(&mut rng)), Health(100.0))));
world.spawn_batch((0..(2_i32.pow(12) - 1)).map(|_| (rnd_wall(&mut rng))));

Self(world)
}

fn scalar(mut pos_healths: Query<(&Position, &mut Health)>, walls: Query<&Wall>) {
pos_healths.for_each_mut(|(position, mut health)| {
// This forms the scalar path: it behaves just like `for_each_mut`.

// Optional: disable change detection for more performance.
let health = &mut health.bypass_change_detection().0;

// Test each (Position,Health) against each Wall.
walls.for_each(|wall| {
let plane = wall.0;

// Test which side of the wall we are on
let dotproj = plane.dot(position.0);

// Test against the Wall's displacement/discriminant value
if dotproj < wall.1 {
//Ouch! Take damage!
*health -= 1.0;
}
});
});
}

// Perform collision detection against a set of Walls, forming a convex polygon.
// Each entity has a Position and some Health (initialized to 100.0).
// If the position of an entity is found to be outside of a Wall, decrement its "health" by 1.0.
// The effect is cumulative based on the number of walls.
// An entity entirely inside the convex polygon will have its health remain unchanged.
fn batched_avx(mut pos_healths: Query<(&Position, &mut Health)>, walls: Query<&Wall>) {
// Conceptually, this system is executed using two loops: the outer "batched" loop receiving
// batches of 8 Positions and Health components at a time, and the inner loop iterating over
// the Walls.

// There's more than one way to vectorize this system -- this example may not be optimal.
pos_healths.for_each_mut_batched::<8>(
|(position, mut health)| {
// This forms the scalar path: it behaves just like `for_each_mut`.

// Optional: disable change detection for more performance.
let health = &mut health.bypass_change_detection().0;

// Test each (Position,Health) against each Wall.
walls.for_each(|wall| {
let plane = wall.0;

// Test which side of the wall we are on
let dotproj = plane.dot(position.0);

// Test against the Wall's displacement/discriminant value
if dotproj < wall.1 {
//Ouch! Take damage!
*health -= 1.0;
}
});
},
|(positions, mut healths)| {
// This forms the vector path: the closure receives a batch of
// 8 Positions and 8 Healths as arrays.

// Optional: disable change detection for more performance.
let healths = healths.bypass_change_detection();

// Treat the Health batch as a batch of 8 f32s.
unsafe {
// # SAFETY: Health is repr(transprent)!
let healths_raw = healths as *mut Health as *mut f32;
let mut healths = _mm256_loadu_ps(healths_raw);

// NOTE: array::map optimizes poorly -- it is recommended to unpack your arrays
// manually as shown to avoid spurious copies which will impact your performance.
let [p0, p1, p2, p3, p4, p5, p6, p7] = positions;

// Perform data layout conversion from AoS to SoA.
// ps_x will receive all of the X components of the positions,
// ps_y will receive all of the Y components
// and ps_z will receive all of the Z's.
let [ps_x, ps_y, ps_z] =
aos_to_soa_83(&[p0.0, p1.0, p2.0, p3.0, p4.0, p5.0, p6.0, p7.0]);

// Iterate over each wall without batching.
walls.for_each(|wall| {
// Test each wall against all 8 positions at once. The "broadcast" intrinsic
// helps us achieve this by duplicating the Wall's X coordinate over an entire
// vector register, e.g., [X X ... X]. The same goes for the Wall's Y and Z
// coordinates.

// This is the exact same formula as implemented in the scalar path, but
// modified to be calculated in parallel across each lane.

// Multiply all of the X coordinates of each Position against Wall's Normal X
let xs_dot = _mm256_mul_ps(ps_x, _mm256_broadcast_ss(&wall.0.x));
// Multiply all of the Y coordinates of each Position against Wall's Normal Y
let ys_dot = _mm256_mul_ps(ps_y, _mm256_broadcast_ss(&wall.0.y));
// Multiply all of the Z coordinates of each Position against Wall's Normal Z
let zs_dot = _mm256_mul_ps(ps_z, _mm256_broadcast_ss(&wall.0.z));

// Now add them together: the result is a vector register containing the dot
// product of each Position against the Wall's Normal vector.
let dotprojs = _mm256_add_ps(_mm256_add_ps(xs_dot, ys_dot), zs_dot);

// Take the Wall's discriminant/displacement value and broadcast it like before.
let wall_d = _mm256_broadcast_ss(&wall.1);

// Compare each dot product against the Wall's discriminant, using the
// "Less Than" relation as we did in the scalar code.
// The result will be be either -1 or zero *as an integer*.
let cmp = _mm256_cmp_ps::<_CMP_LT_OS>(dotprojs, wall_d);

// Convert the integer values back to f32 values (-1.0 or 0.0).
// These form the damage values for each entity.
let damages = _mm256_cvtepi32_ps(_mm256_castps_si256(cmp)); //-1.0 or 0.0

// Update the healths of each entity being processed with the results of the
// collision detection.
healths = _mm256_add_ps(healths, damages);
});
// Now that all Walls have been processed, write the final updated Health values
// for this batch of entities back to main memory.
_mm256_storeu_ps(healths_raw, healths);
}
},
);
}
}

pub fn batched_compute(c: &mut Criterion) {
let mut group = c.benchmark_group("batched_compute");
group.warm_up_time(std::time::Duration::from_secs(1));
group.measurement_time(std::time::Duration::from_secs(9));

for exp in 14..17 {
let size = 2_i32.pow(exp) - 1; //Ensure scalar path gets run too (incomplete batch at end)

group.bench_with_input(
BenchmarkId::new("autovectorized", size),
&size,
|b, &size| {
let Benchmark(mut world) = Benchmark::new(size);

let mut system = IntoSystem::into_system(Benchmark::scalar);
system.initialize(&mut world);
system.update_archetype_component_access(&world);

b.iter(move || system.run((), &mut world));
},
);

group.bench_with_input(BenchmarkId::new("batched_avx", size), &size, |b, &size| {
let Benchmark(mut world) = Benchmark::new(size);

let mut system = IntoSystem::into_system(Benchmark::batched_avx);
system.initialize(&mut world);
system.update_archetype_component_access(&world);

b.iter(move || system.run((), &mut world));
});
}

group.finish();
}
16 changes: 16 additions & 0 deletions benches/benches/bevy_ecs/iteration/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
use criterion::*;

#[cfg(target_feature = "avx")]
mod batched_compute;

mod heavy_compute;
mod iter_frag;
mod iter_frag_foreach;
Expand All @@ -19,8 +22,21 @@ mod iter_simple_system;
mod iter_simple_wide;
mod iter_simple_wide_sparse_set;

#[cfg(target_feature = "avx")]
use batched_compute::batched_compute;

use heavy_compute::*;

#[cfg(target_feature = "avx")]
criterion_group!(
iterations_benches,
iter_frag,
iter_frag_sparse,
iter_simple,
heavy_compute,
batched_compute,
);
#[cfg(not(target_feature = "avx"))]
criterion_group!(
iterations_benches,
iter_frag,
Expand Down
1 change: 1 addition & 0 deletions crates/bevy_ecs/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ serde = { version = "1", features = ["derive"] }

[dev-dependencies]
rand = "0.8"
bevy_math = { path = "../bevy_math", version = "0.9.0-dev" }

[[example]]
name = "events"
Expand Down
9 changes: 6 additions & 3 deletions crates/bevy_ecs/src/archetype.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ use crate::{
bundle::BundleId,
component::{ComponentId, StorageType},
entity::{Entity, EntityLocation},
storage::{ImmutableSparseSet, SparseArray, SparseSet, SparseSetIndex, TableId, TableRow},
storage::{
aligned_vec::SimdAlignedVec, ImmutableSparseSet, SparseArray, SparseSet, SparseSetIndex,
TableId, TableRow,
},
};
use std::{
collections::HashMap,
Expand Down Expand Up @@ -297,7 +300,7 @@ pub struct Archetype {
id: ArchetypeId,
table_id: TableId,
edges: Edges,
entities: Vec<ArchetypeEntity>,
entities: SimdAlignedVec<ArchetypeEntity>,
components: ImmutableSparseSet<ComponentId, ArchetypeComponentInfo>,
}

Expand Down Expand Up @@ -333,7 +336,7 @@ impl Archetype {
Self {
id,
table_id,
entities: Vec::new(),
entities: SimdAlignedVec::new(),
components: components.into_immutable(),
edges: Default::default(),
}
Expand Down
Loading