[RFC] Removing Hard-coded module paths for Parallelization #4

HanGuo97 · 2025-01-26T22:45:43Z

Proposal

The current parallelization utilities have hard-coded methods to obtain specific types of modules (e.g., layers, embedding, norms). For instance, the following line assumes that the model has a .model.layers attribute.

flame/flame/parallelisms/parallelize_fla.py

Line 239 in 816c326

for layer_id, block in enumerate(model.model.layers):

This is not necessarily true for all models in the FLA library (Mamba2 uses .backbone). The folder contains a few other such instances.

I am considering adding a new file parallelisms/utils.py that allows for the registration of model classes and corresponding getters, as shown below:

# Register a model class
ModelRegistry.register(
    xxxForCausalLM,
    embeddings_path=“model.embedding”,  # Custom path if different from default
    norm_path=“model.norm”,
    lm_head_path=“lm_head”,
    layers_path=“model.layers”
)

# Utilize the registry in parallelisms/parallelize_fla.py
model = xxxForCausalLM(...)
embeddings = get_embeddings(model)
norm = get_norm(model)
lm_head = get_lm_head(model)
layers = get_layers(model)

Any thoughts? I would be happy to make a PR for that, but I am not familiar enough with the FLA library to determine if this is over-engineering. If most models indeed follow the hard-coded patterns, it might be simpler to just add some if-else statements within parallelize_fla.py.

Rationale

No response

The text was updated successfully, but these errors were encountered:

yzhangcs · 2025-01-27T05:26:30Z

@HanGuo97 Nice suggestion! PRs are welcome lol. I'd be happy to collaborate on these in the next coming days!

yzhangcs · 2025-01-27T05:30:11Z

@HanGuo97 feels like there are much simpler solutions.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mamba/modeling_mamba.py#L385

We only need to revise very few things in fla and flame.

Update:

Fixed by fla-org/flash-linear-attention@7f9f83c

For now we can retrieve layers via

for i, layer in enumerate(getattr(model, model.base_model_prefix).layers):
    ...

rakkit · 2025-01-27T11:56:05Z

general saying, in the context of FSDP, we don't need a complex patch. As long as all blocks can be accessed via model.model.layers, following HF's transformer xxxForCausalLM design typically. In principle, we should only use FSDP to shard the blocks. (Heads/embedding should go to TP if needed).

the actual problem comes from TP/CP, a possible way maybe is to add the rules for each model individually in a separate file.

yzhangcs · 2025-01-27T11:58:24Z

@rakkit

the actual problem comes from TP/CP, a more realistic way is to add the rules for each model individually in a separate file.

Yeeeesssss! Found it's quite hard to define rules in this repo. Will consider handling TP/CP in fla model by model in the near future.

rakkit · 2025-01-27T12:15:47Z

yip. maybe it's not that bad.

for TP, we can just add a function in the block, MLP and attention level to return the tp_plan, in fla.
here we can just call tp_plan=block.get_tp_plan(). And in fla lib and block level, we can recursively query mlp/attention's TP plan.

This should give us enough flexibility to automatically handle any combination of blocks in fla.

embedding, out norm and head is easier to deal with. but I am not sure if TP & FusedCrossEntropyLoss can work out of box,

HanGuo97 · 2025-01-27T22:27:33Z

Ah, thanks for making this easy upstream!

FYI, I made a PR for the FSDP part #5. For the TP part, this seems to be a bit more annoying so I'm going to leave that out for now.

yzhangcs · 2025-01-28T10:42:04Z

Closing this issue as the original hard-coded qustions have been solved. More discussions for 4d parallel could be found in fla-org/flash-linear-attention#148

HanGuo97 added the enhancement New feature or request label Jan 26, 2025

yzhangcs mentioned this issue Jan 28, 2025

[RFC] Implement model-specific 4d parallelism fla-org/flash-linear-attention#148

Open

yzhangcs closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Removing Hard-coded module paths for Parallelization #4

[RFC] Removing Hard-coded module paths for Parallelization #4

HanGuo97 commented Jan 26, 2025

yzhangcs commented Jan 27, 2025

yzhangcs commented Jan 27, 2025 •

edited

Loading

rakkit commented Jan 27, 2025 •

edited

Loading

yzhangcs commented Jan 27, 2025 •

edited

Loading

rakkit commented Jan 27, 2025 •

edited

Loading

HanGuo97 commented Jan 27, 2025

yzhangcs commented Jan 28, 2025

[RFC] Removing Hard-coded module paths for Parallelization #4

[RFC] Removing Hard-coded module paths for Parallelization #4

Comments

HanGuo97 commented Jan 26, 2025

Proposal

Rationale

yzhangcs commented Jan 27, 2025

yzhangcs commented Jan 27, 2025 • edited Loading

rakkit commented Jan 27, 2025 • edited Loading

yzhangcs commented Jan 27, 2025 • edited Loading

rakkit commented Jan 27, 2025 • edited Loading

HanGuo97 commented Jan 27, 2025

yzhangcs commented Jan 28, 2025

yzhangcs commented Jan 27, 2025 •

edited

Loading

rakkit commented Jan 27, 2025 •

edited

Loading

yzhangcs commented Jan 27, 2025 •

edited

Loading

rakkit commented Jan 27, 2025 •

edited

Loading