Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: a2a various optimizations #1067

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Jan 2, 2025

What

This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain
before_after

Support rectangular blocks

this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:

  • FORCE_WIDER, imposing h <= w
  • FORCE_LONGER, imposing h >= w
    rect_blocks

Reuse device memory chunks for several blocks

as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option NBR_SERIALIZED_BATCHES.

output5

batch the inter-node RDMA sends

we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by SEND_BATCH_SIZE

batch_size

Iterate across nodes before blocks when posting the WQEs

allows to better saturate the NW. This option is controlled by NBR_BATCHES_PER_PASSAGE which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1

batch_per_passage2

option to force regular case

through the TL/MLX5's env FORCE_REGULAR, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.

All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.

@@ -786,12 +835,15 @@ UCC_TL_MLX5_PROFILE_FUNC(ucc_status_t, ucc_tl_mlx5_alltoall_init,
== a2a->node.asr_rank);
int n_tasks = is_asr ? 5 : 3;
int curr_task = 0;
int ppn = tl_team->a2a->node.sbgp->group_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe align assignments and variables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang format did that. It is aligned with the line below. Is it ok or do you suggest something else?

Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left minor comments, but overall looks very solid - thanks!

@janjust janjust requested a review from MamziB February 20, 2025 16:14
Comment on lines 37 to 43
{"FORCE_LONGER", "y", "Force the blocks to have more height than width",
ucc_offsetof(ucc_tl_mlx5_lib_config_t, force_longer),
UCC_CONFIG_TYPE_BOOL},

{"FORCE_WIDER", "n", "Force the blocks to have more width than height",
ucc_offsetof(ucc_tl_mlx5_lib_config_t, force_wider), UCC_CONFIG_TYPE_BOOL},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it's better to have single enum that controls block shape. for example UCC_TL_MLX5_ALLTOALL_BLOCK_SHAPE=[wide,long,regular,auto]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the three options are actually orthogonal.

regular/not regular is about dividing ppn,
and wider/longer is understood with closed inequalities, so for example both wider and longer needs to be activated to ensure square blocks.

Does it make sense ? Maybe there is something cleaner but I couldn't figure it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe regular is not a good name since we already have an option with the same name, but block_shape=[wide, long, square] should work

{
return block_size * ucc_max(power2(block_size) * msgsize, MAX_MSG_SIZE) <=
MAX_TRANSPOSE_SIZE;
size_t t1 = power2(ucc_max(msgsize, 8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be ucc_align_up_pow2 instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The goal of this function is to round to the closest power of 2 which is equal or greater than x.
But ucc_align_up_pow2 takes the power of two as argument, doesn't compute the closest greater power of 2.

But please correct me if I'm wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, there is a typo in the current implementation: we need to start from p=1, not p=2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, pls move it to ucc_math.h

}

static inline int get_block_size(ucc_tl_mlx5_schedule_t *task)
static inline void get_block_dimensions(int ppn, int msgsize, int force_regular,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe do something like this. It will need less number of iteration to converge

static inline void get_block_dimensions(int ppn, int msgsize, int force_regular,
                                        int force_longer, int force_wider,
                                        int *block_height, int *block_width)
{
    size_t max_area = MAX_TRANSPOSE_SIZE / msgsize;
    int h = (int)sqrt(max_area); 
    int w = max_area / h;

    h = (h > 64) ? 64 : (h < 1) ? 1 : h;
    w = (w > 64) ? 64 : (w < 1) ? 1 : w;

    if (force_regular) {
        while (h > 1 && ppn % h != 0) h--;
        while (w > 1 && ppn % w != 0) w--;
    }
    if (force_longer && w > h) {
        int temp = h; h = w; w = temp;
    }
    if (force_wider && h > w) {
        int temp = h; h = w; w = temp;
    }

    while (h > 1 && !block_size_fits(msgsize, h, w)) h--;
    while (w > 1 && !block_size_fits(msgsize, h, w)) w--;

    *block_height = h;
    *block_width  = w;
}

Copy link
Collaborator Author

@samnordmann samnordmann Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid your proposition is not able to compute the optimal (h,w).

I agree your function is cheaper as it is O(n) while mine is O(n**2), but since n=64 is fixed here, I wouldn't worry too much about the complexity (but I might be wrong).

I admit there might be a more clever way to achieve what I want, but unfortunately your proposition doesn't seem to work optimally, as you can see in the following csv comparing the results of the two versions (the current one is called "v1", yours "v2")
block_dimensions.csv

I generated it from the following python script:

import math
import pandas as pd

MAX_MSIZE = 8192
MAX_NCOLS = 64
MAX_NROWS = 64
MAX_MSGSIZE = 128

def round_to_next_power_of_2(n: int) -> int:
    return 2 ** math.ceil(math.log2(n))

def block_size_for_HW_limit(msgsize: int, nrows: int, ncols: int) -> int:
    return nrows * max(round_to_next_power_of_2(ncols) * round_to_next_power_of_2(max(msgsize, 8)), 128)

def block_size_fits(msgsize: int, nrows: int, ncols: int) -> bool:
    return msgsize <= MAX_MSGSIZE and nrows <= MAX_NROWS and ncols <= MAX_NCOLS and block_size_for_HW_limit(msgsize, nrows, ncols) <= MAX_MSIZE

def get_block_dimensions(ppn: int, msgsize: int, force_regular: bool, force_longer: bool, force_wider: bool) -> tuple[int, int]:
    h_best: int = 1
    w_best: int = 1
    for h in range(1, MAX_NCOLS + 1):
        if force_regular and ppn % h != 0:
            continue
        for w in range(1, MAX_NCOLS + 1):
            if (force_regular and ppn % w != 0) or (force_wider and w < h) or (force_longer and w > h):
                continue
            if block_size_fits(msgsize, h, w) and h * w >= h_best * w_best:
                if h * w > h_best * w_best or abs(h / w - 1) < abs(h_best / w_best - 1):
                    h_best = h
                    w_best = w
    return (h_best, w_best)

def get_block_dimensions_v2(ppn: int, msgsize: int, force_regular: bool, force_longer: bool, force_wider: bool) -> tuple[int, int]:
    max_area = MAX_MSIZE / MAX_MSGSIZE
    h = int(math.sqrt(max_area))
    w = int(max_area // h)

    h = max(1, min(h, MAX_NROWS))
    w = max(1, min(w, MAX_NCOLS))

    if force_regular:
        while h > 1 and ppn % h != 0:
            h -= 1
        while w > 1 and ppn % w != 0:
            w -= 1
    if force_longer and w > h:
        h, w = w, h
    if force_wider and h > w:
        h, w = w, h

    while h > 1 and not block_size_fits(msgsize, h, w):
        h -= 1
    while w > 1 and not block_size_fits(msgsize, h, w):
        w -= 1

    return (h, w)

def compute_all_block_dimensions():
    PPN = [2 ** i for i in range(1, 7)]
    MSGSIZE = [2 ** i for i in range(8)]
    FORCE_REGULAR=[False, True]
    FORCE_LONGER=[False, True]
    FORCE_WIDER=[False, True]
    VERSIONS = [("v1", get_block_dimensions), ("v2", get_block_dimensions_v2)]
    results = []
    for ppn in PPN:
        for msgsize in MSGSIZE:
            for force_regular in FORCE_REGULAR:
                for force_longer in FORCE_LONGER:
                    for force_wider in FORCE_WIDER:
                        for v_name, f in VERSIONS:
                            h, w = f(ppn, msgsize, force_regular, force_longer, force_wider)
                            results.append({'version': v_name,
                                            'ppn': ppn,
                                            'msgsize': msgsize,
                                            'force_regular': force_regular,
                                            'force_longer': force_longer,
                                            'force_wider': force_wider,
                                            'h': h,
                                            'w': w,
                                            'h * w': h * w})
    return pd.DataFrame(results)

if __name__ == "__main__":
    results = compute_all_block_dimensions()
    print(results)
    results.to_csv('block_dimensions.csv', index=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but it's possible to cut number of iterations depending on block shape. For instance

  1. if regular then max block size is ppn
  2. if force longer than max w is h and vice versa
  3. also min w should be equal or greater than (h_best * w_best) / h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants