Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtraBuffers: revamping of the idea #802

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Feb 17, 2023

This is a revamping of a concept we already tested in #386. Due to the important changes happened to the codebase in the meanwhile, together with the fact that it was just a feasibility test, we opted for the easier path of just re-implementing the idea.

ExtraBuffers is a utility that comes handy for the "reduce" pattern, i.e. multiple tasks wants to write on the same memory. ExtraBuffers gives you a set of "buffer" tiles that tasks can use to write in parallel, and at the end the reduction is done on request.

Note
Differently from the original one, this PR went for a simpler implementation, where "partial computations" are stored exclusively on ExtraBuffers tiles, without using the tile where the result will be reduced afterwards.

In this PR there is also a first usage example, in W2 computation inside ReductionToBand.

TODO

  • Evaluate if reduction output tile has to be resetted or not
  • Fix and improve the test
  • Fix GPU

@albestro albestro added this to the Optimizations milestone Feb 17, 2023
@albestro albestro self-assigned this Feb 17, 2023
@albestro albestro force-pushed the alby/new-extra-buffers branch from 343d087 to 4e52fcf Compare February 22, 2023 17:46
@albestro albestro force-pushed the alby/new-extra-buffers branch from 4e52fcf to f2d49df Compare February 22, 2023 17:51
@albestro albestro force-pushed the alby/new-extra-buffers branch from f2d49df to d9da590 Compare February 22, 2023 17:52
Comment on lines +475 to +476
ex::start_detached(tile::set0(dlaf::internal::Policy<B>(), w2.readwrite_sender(LocalTileIndex(0, 0))));
ex::start_detached(buffers.reduce(w2.readwrite_sender(LocalTileIndex(0, 0))));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to create a continuation with a single task?

[[nodiscard]] auto reduce(TileSender tile) {
namespace ex = pika::execution::experimental;

std::vector<ex::any_sender<pika::shared_future<matrix::Tile<const T, D>>>> buffers;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is so cumbersome, probably @msimberg is aware of this and is going to say that it is something we will manage when we will make the tile sender mechanism pika::future free.

I don't know if there is any workaround available, but AFAIK currently our Unwrapping facilities does not unwrap vector, so we just get it sent as it is even using our dlaf::internal::transform.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your complaint, but typedef exists 😉 I see that complex eventually being a helper typedef inside e.g. Matrix, or at least in the dlaf::matrix namespace and then you'll be able to write std::vector<ReadOnlyTileSender> or something like that. Would that partially or completely take care of your concerns here? This is what I'm planning on doing in #766 (because currently it's also dealing with long unwieldy types).

Comment on lines 48 to 70
return ex::when_all(std::move(tile), ex::when_all_vector(std::move(buffers))) |
dlaf::internal::transform(dlaf::internal::Policy<DefaultBackend_v<D>>(),
[](const matrix::Tile<T, D>& tile,
const std::vector<pika::shared_future<matrix::Tile<const T, D>>>&
buffers,
auto&&... ts) {
for (const auto& buffer : buffers) {
if constexpr (D == Device::CPU) {
static_assert(sizeof...(ts) == 0,
"Parameter pack should be empty for MC.");
dlaf::tile::internal::add(T(1), buffer.get(), tile);
}
#ifdef DLAF_WITH_GPU
else if constexpr (D == Device::GPU) {
dlaf::tile::internal::add(T(1), buffer.get(), tile, ts...);
}
#endif
else {
DLAF_STATIC_UNIMPLEMENTED(T);
}
}
});
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really not nice...15 lines for saying

for (const auto& buffer : buffers)
  dlaf::tile::internal::add(T(1), buffer.get(), <optional whip::stream_t>);

I know we can go with something like

dlaf::tile::internal::add(T(1), buffer.get(), ts...);

but we will loose a bit of static checking...

do we have any better alternative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

2 participants