TL/MLX5: a2a various optimizations #1067

samnordmann · 2025-01-02T16:38:06Z

What

This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain

Support rectangular blocks

this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:

FORCE_WIDER, imposing h <= w
FORCE_LONGER, imposing h >= w

Reuse device memory chunks for several blocks

as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option NBR_SERIALIZED_BATCHES.

batch the inter-node RDMA sends

we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by SEND_BATCH_SIZE

Iterate across nodes before blocks when posting the WQEs

allows to better saturate the NW. This option is controlled by NBR_BATCHES_PER_PASSAGE which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1

option to force regular case

through the TL/MLX5's env FORCE_REGULAR, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.

All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.

src/components/tl/mlx5/alltoall/alltoall_coll.c

src/components/tl/mlx5/tl_mlx5.c

janjust · 2025-02-12T16:18:14Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

@@ -786,12 +835,15 @@ UCC_TL_MLX5_PROFILE_FUNC(ucc_status_t, ucc_tl_mlx5_alltoall_init,
                                                        == a2a->node.asr_rank);
    int                     n_tasks   = is_asr ? 5 : 3;
    int                     curr_task = 0;
+    int                       ppn       = tl_team->a2a->node.sbgp->group_size;


Maybe align assignments and variables?

clang format did that. It is aligned with the line below. Is it ok or do you suggest something else?

janjust

I left minor comments, but overall looks very solid - thanks!

src/components/tl/mlx5/tl_mlx5.c

Sergei-Lebedev · 2025-01-24T14:28:27Z

src/components/tl/mlx5/tl_mlx5.c

+    {"FORCE_LONGER", "y", "Force the blocks to have more height than width",
+     ucc_offsetof(ucc_tl_mlx5_lib_config_t, force_longer),
+     UCC_CONFIG_TYPE_BOOL},
+
+    {"FORCE_WIDER", "n", "Force the blocks to have more width than height",
+     ucc_offsetof(ucc_tl_mlx5_lib_config_t, force_wider), UCC_CONFIG_TYPE_BOOL},
+


probably it's better to have single enum that controls block shape. for example UCC_TL_MLX5_ALLTOALL_BLOCK_SHAPE=[wide,long,regular,auto]

all the three options are actually orthogonal.

regular/not regular is about dividing ppn,
and wider/longer is understood with closed inequalities, so for example both wider and longer needs to be activated to ensure square blocks.

Does it make sense ? Maybe there is something cleaner but I couldn't figure it out.

maybe regular is not a good name since we already have an option with the same name, but block_shape=[wide, long, square] should work

src/components/tl/mlx5/tl_mlx5.c

src/components/tl/mlx5/alltoall/alltoall_mkeys.c

src/components/tl/mlx5/alltoall/alltoall_coll.c

Sergei-Lebedev · 2025-01-24T14:46:11Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

 {
-    return block_size * ucc_max(power2(block_size) * msgsize, MAX_MSG_SIZE) <=
-           MAX_TRANSPOSE_SIZE;
+    size_t t1    = power2(ucc_max(msgsize, 8));


should it be ucc_align_up_pow2 instead?

I don't think so. The goal of this function is to round to the closest power of 2 which is equal or greater than x.
But ucc_align_up_pow2 takes the power of two as argument, doesn't compute the closest greater power of 2.

But please correct me if I'm wrong

btw, there is a typo in the current implementation: we need to start from p=1, not p=2

right, pls move it to ucc_math.h

src/components/tl/mlx5/alltoall/alltoall_coll.c

Sergei-Lebedev · 2025-02-26T09:35:52Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

 }

-static inline int get_block_size(ucc_tl_mlx5_schedule_t *task)
+static inline void get_block_dimensions(int ppn, int msgsize, int force_regular,


maybe do something like this. It will need less number of iteration to converge

static inline void get_block_dimensions(int ppn, int msgsize, int force_regular, int force_longer, int force_wider, int *block_height, int *block_width) { size_t max_area = MAX_TRANSPOSE_SIZE / msgsize; int h = (int)sqrt(max_area); int w = max_area / h; h = (h > 64) ? 64 : (h < 1) ? 1 : h; w = (w > 64) ? 64 : (w < 1) ? 1 : w; if (force_regular) { while (h > 1 && ppn % h != 0) h--; while (w > 1 && ppn % w != 0) w--; } if (force_longer && w > h) { int temp = h; h = w; w = temp; } if (force_wider && h > w) { int temp = h; h = w; w = temp; } while (h > 1 && !block_size_fits(msgsize, h, w)) h--; while (w > 1 && !block_size_fits(msgsize, h, w)) w--; *block_height = h; *block_width = w; }

I'm afraid your proposition is not able to compute the optimal (h,w).

I agree your function is cheaper as it is O(n) while mine is O(n**2), but since n=64 is fixed here, I wouldn't worry too much about the complexity (but I might be wrong).

I admit there might be a more clever way to achieve what I want, but unfortunately your proposition doesn't seem to work optimally, as you can see in the following csv comparing the results of the two versions (the current one is called "v1", yours "v2")
block_dimensions.csv

I generated it from the following python script:

import math import pandas as pd MAX_MSIZE = 8192 MAX_NCOLS = 64 MAX_NROWS = 64 MAX_MSGSIZE = 128 def round_to_next_power_of_2(n: int) -> int: return 2 ** math.ceil(math.log2(n)) def block_size_for_HW_limit(msgsize: int, nrows: int, ncols: int) -> int: return nrows * max(round_to_next_power_of_2(ncols) * round_to_next_power_of_2(max(msgsize, 8)), 128) def block_size_fits(msgsize: int, nrows: int, ncols: int) -> bool: return msgsize <= MAX_MSGSIZE and nrows <= MAX_NROWS and ncols <= MAX_NCOLS and block_size_for_HW_limit(msgsize, nrows, ncols) <= MAX_MSIZE def get_block_dimensions(ppn: int, msgsize: int, force_regular: bool, force_longer: bool, force_wider: bool) -> tuple[int, int]: h_best: int = 1 w_best: int = 1 for h in range(1, MAX_NCOLS + 1): if force_regular and ppn % h != 0: continue for w in range(1, MAX_NCOLS + 1): if (force_regular and ppn % w != 0) or (force_wider and w < h) or (force_longer and w > h): continue if block_size_fits(msgsize, h, w) and h * w >= h_best * w_best: if h * w > h_best * w_best or abs(h / w - 1) < abs(h_best / w_best - 1): h_best = h w_best = w return (h_best, w_best) def get_block_dimensions_v2(ppn: int, msgsize: int, force_regular: bool, force_longer: bool, force_wider: bool) -> tuple[int, int]: max_area = MAX_MSIZE / MAX_MSGSIZE h = int(math.sqrt(max_area)) w = int(max_area // h) h = max(1, min(h, MAX_NROWS)) w = max(1, min(w, MAX_NCOLS)) if force_regular: while h > 1 and ppn % h != 0: h -= 1 while w > 1 and ppn % w != 0: w -= 1 if force_longer and w > h: h, w = w, h if force_wider and h > w: h, w = w, h while h > 1 and not block_size_fits(msgsize, h, w): h -= 1 while w > 1 and not block_size_fits(msgsize, h, w): w -= 1 return (h, w) def compute_all_block_dimensions(): PPN = [2 ** i for i in range(1, 7)] MSGSIZE = [2 ** i for i in range(8)] FORCE_REGULAR=[False, True] FORCE_LONGER=[False, True] FORCE_WIDER=[False, True] VERSIONS = [("v1", get_block_dimensions), ("v2", get_block_dimensions_v2)] results = [] for ppn in PPN: for msgsize in MSGSIZE: for force_regular in FORCE_REGULAR: for force_longer in FORCE_LONGER: for force_wider in FORCE_WIDER: for v_name, f in VERSIONS: h, w = f(ppn, msgsize, force_regular, force_longer, force_wider) results.append({'version': v_name, 'ppn': ppn, 'msgsize': msgsize, 'force_regular': force_regular, 'force_longer': force_longer, 'force_wider': force_wider, 'h': h, 'w': w, 'h * w': h * w}) return pd.DataFrame(results) if __name__ == "__main__": results = compute_all_block_dimensions() print(results) results.to_csv('block_dimensions.csv', index=False)

ok, but it's possible to cut number of iterations depending on block shape. For instance

if regular then max block size is ppn

if force longer than max w is h and vice versa

also min w should be equal or greater than (h_best * w_best) / h

…us_optims

samnordmann added 2 commits January 2, 2025 18:35

TL/MLX5: fix nrows ncols in WQE

ffdcec9

TL/MLX5: various optimizations

5e60c92

samnordmann mentioned this pull request Jan 2, 2025

TL/MLX5: various optimizations #1012

Closed

TL/MLX5: tune default config

fd47eb5

samnordmann added the Ready-for-Review label Jan 2, 2025

samnordmann requested review from Sergei-Lebedev and janjust January 2, 2025 21:00

janjust added the Code-Review-Required label Jan 6, 2025

samnordmann commented Jan 22, 2025

View reviewed changes

src/components/tl/mlx5/alltoall/alltoall_coll.c Outdated Show resolved Hide resolved

samnordmann added Code-Review-Completed and removed Code-Review-Required labels Jan 27, 2025

janjust reviewed Feb 12, 2025

View reviewed changes

src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved

janjust reviewed Feb 12, 2025

View reviewed changes

src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved

janjust reviewed Feb 12, 2025

View reviewed changes

src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved

janjust reviewed Feb 12, 2025

View reviewed changes

src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved

janjust reviewed Feb 12, 2025

View reviewed changes

janjust approved these changes Feb 12, 2025

View reviewed changes

janjust requested a review from MamziB February 20, 2025 16:14

Sergei-Lebedev reviewed Feb 26, 2025

View reviewed changes

samnordmann added 3 commits February 27, 2025 15:27

minor comments

cbd6623

CODESTYLE: clang

0e922f8

Merge branch 'master' of github.com:openucx/ucc into tl_mlx5_pr_vario…

38d8d36

…us_optims

samnordmann requested a review from Sergei-Lebedev February 27, 2025 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/MLX5: a2a various optimizations #1067

TL/MLX5: a2a various optimizations #1067

samnordmann commented Jan 2, 2025 •

edited

Loading

janjust Feb 12, 2025

samnordmann Feb 26, 2025

janjust left a comment

Sergei-Lebedev Jan 24, 2025

samnordmann Feb 26, 2025

Sergei-Lebedev Feb 28, 2025

Sergei-Lebedev Jan 24, 2025

samnordmann Feb 26, 2025

samnordmann Feb 26, 2025

Sergei-Lebedev Feb 28, 2025

Sergei-Lebedev Feb 26, 2025

samnordmann Feb 27, 2025 •

edited

Loading

Sergei-Lebedev Feb 28, 2025

TL/MLX5: a2a various optimizations #1067

Are you sure you want to change the base?

TL/MLX5: a2a various optimizations #1067

Conversation

samnordmann commented Jan 2, 2025 • edited Loading

What

Support rectangular blocks

Reuse device memory chunks for several blocks

batch the inter-node RDMA sends

Iterate across nodes before blocks when posting the WQEs

option to force regular case

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janjust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samnordmann Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samnordmann commented Jan 2, 2025 •

edited

Loading

samnordmann Feb 27, 2025 •

edited

Loading