Skip to content

[feat] fsdp2 memory_efficient_init#117

Merged
kevssim merged 37 commits intomodelscope:mainfrom
kevssim:optimize_fsdp_init
Mar 25, 2026
Merged

[feat] fsdp2 memory_efficient_init#117
kevssim merged 37 commits intomodelscope:mainfrom
kevssim:optimize_fsdp_init

Conversation

@kevssim
Copy link
Collaborator

@kevssim kevssim commented Mar 18, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Add memory_efficient_init support for FSDP (including Accelerate Strategy and Native FSDP Strategy) to reduce peak memory and VRAM usage during the model initialization phase.

Core idea: Before FSDP wrapping, only rank 0 holds the full parameters, while other ranks move the model to the meta device. After wrapping is completed, parameters are broadcast and sharded across ranks via broadcast + distribute_tensor, avoiding each rank loading the full model weights.

Main changes:

  • NativeFSDPStrategy.wrap_model: Added a meta-device initialization process — save state_dict → to('meta')fully_shard → broadcast sharded parameters → restore non-persistent buffers
  • AccelerateStrategy: Achieves the same effect via the cpu_ram_efficient_loading configuration option and environment variable context manager
  • Added load_context.py: Provides fsdp_pretrained_load_context, which temporarily sets the ACCELERATE_USE_FSDP / FSDP_CPU_RAM_EFFICIENT_LOADING environment variables during from_pretrained

Note: The optimization currently only applies to transformers <= 4.57.x; for transformers >= 5.0.x, it may lead to negative performance impact.

Experiment results

ENV: 4xH800, transformers==4.57.6, Qwen3-8B

Sampling CPU memory and GPU memory during the transformers.from_pretrained + optimizer creation + wrap_model process

Accelerate: memory_efficient=True vs. memory_efficient=False

image image image

Accelerate vs. Native FSDP

image image image

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the memory footprint during the initialization of FSDP models, particularly for large Transformer architectures. By leveraging meta-device initialization and distributed state dict handling, it allows models to be loaded and sharded more efficiently across multiple GPUs, reducing the peak memory consumption on both CPU and GPU. The changes ensure that both native FSDP and Accelerate-based FSDP strategies can benefit from these memory improvements, making it feasible to train larger models or use more aggressive sharding configurations.

Highlights

  • Memory-Efficient FSDP Initialization: Introduced a new memory_efficient_init parameter in TransformersModel to enable optimized FSDP model loading, significantly reducing CPU and GPU memory usage during initialization for large models.
  • Native FSDP Strategy Enhancements: Implemented a meta-device flow within the NativeFSDPStrategy's wrap_model method. This involves moving the model to a meta-device, performing fully_shard, and then broadcasting the sharded state dictionary from rank 0 to materialize parameters on the correct devices.
  • Accelerate Strategy Integration: Ensured compatibility with the Accelerate FSDP strategy by conditionally setting ACCELERATE_USE_FSDP and FSDP_CPU_RAM_EFFICIENT_LOADING environment variables around from_pretrained calls, allowing Accelerate's built-in memory-efficient loading path to be utilized.
  • Non-Persistent Buffer Handling: Added utility functions (_get_non_persistent_buffers, _restore_non_persistent_buffers) to correctly save and restore non-persistent model buffers when moving models to and from meta-devices, preventing data loss.
  • Comprehensive Testing: Included new unit and integration tests to validate the memory-efficient FSDP initialization process, covering state dict broadcasting, buffer handling, and end-to-end model training with the new optimizations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization for FSDP model initialization by implementing a memory-efficient path. The changes are well-architected, with clear logic for both the accelerate and native_fsdp strategies. The use of a meta-device initialization flow, along with robust handling of environment variables for integration with the transformers library, is commendable. The addition of a comprehensive test suite, including unit, integration, and regression tests, greatly enhances the reliability of this new feature. I have one suggestion to further improve the test coverage for the environment variable manipulation logic.

@kevssim kevssim closed this Mar 23, 2026
@kevssim kevssim reopened this Mar 24, 2026
@kevssim kevssim changed the title [WIP]optimize memory usage of fsdp2 init model [feat] fsdp memory_efficient_init Mar 25, 2026
@kevssim kevssim marked this pull request as ready for review March 25, 2026 03:20
@kevssim
Copy link
Collaborator Author

kevssim commented Mar 25, 2026

/gemini review

@kevssim kevssim changed the title [feat] fsdp memory_efficient_init [feat] fsdp2 memory_efficient_init Mar 25, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory_efficient_init parameter to enable memory-efficient FSDP model initialization, where only rank 0 loads full weights and broadcasts sharded parameters. This feature is integrated into AccelerateStrategy and NativeFSDPStrategy, and new utility functions are added to manage state dicts and buffers during this process. The documentation has been updated to reflect this new parameter. The reviewer suggested improving the clarity of the transformers version specifier in the documentation from 4.57.x to <4.58> for both English and Chinese versions.

@kevssim
Copy link
Collaborator Author

kevssim commented Mar 25, 2026

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a memory-efficient initialization mechanism for FSDP (Fully Sharded Data Parallel) models, applicable to both native PyTorch FSDP and Accelerate strategies. The core idea is to reduce peak memory and VRAM usage during model loading and initialization by having only rank 0 load full model weights, while other ranks use a meta-device and then receive sharded parameters via broadcast. This optimization allows for more efficient handling of large models, particularly with transformers versions up to 4.57.x.

Highlights

  • Memory-Efficient FSDP Initialization: Introduced a new memory_efficient_init parameter across the TransformersModel and its strategies to enable optimized FSDP model loading, significantly reducing peak memory usage during initialization.
  • Native FSDP Strategy Enhancement: Implemented a meta-device initialization flow within NativeFSDPStrategy.wrap_model. This involves moving the model to a meta-device, performing fully_shard, broadcasting sharded parameters from rank 0, and restoring non-persistent buffers to reduce memory footprint.
  • Accelerate Strategy Integration: Integrated memory-efficient initialization for Accelerate by introducing a context manager that conditionally sets ACCELERATE_USE_FSDP and FSDP_CPU_RAM_EFFICIENT_LOADING environment variables during from_pretrained calls.
  • Dedicated Load Context Module: Added a new module load_context.py providing fsdp_pretrained_load_context, a context manager to temporarily set environment variables for FSDP-aware loading in transformers.
  • Documentation Updates: Updated both English and Chinese documentation for TransformersModel to include the new memory_efficient_init parameter and its usage details.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source_en/Components/Model/TransformersModel.md
    • Added memory_efficient_init parameter to the TransformersModel constructor.
    • Documented the purpose and behavior of the memory_efficient_init parameter, including its default value and version compatibility notes.
  • docs/source_zh/组件/模型/TransformersModel.md
    • Added memory_efficient_init parameter to the TransformersModel constructor (Chinese).
    • Documented the purpose and behavior of the memory_efficient_init parameter in Chinese, including its default value and version compatibility notes.
  • src/twinkle/model/transformers/strategy/accelerate.py
    • Imported fsdp_pretrained_load_context for managing environment variables.
    • Added memory_efficient_init parameter to the AccelerateStrategy constructor and stored its value.
    • Introduced a pretrained_load_context method to return the FSDP-aware loading context.
    • Modified _fsdp_config_from_device_mesh to accept and utilize the memory_efficient flag for cpu_ram_efficient_loading.
  • src/twinkle/model/transformers/strategy/load_context.py
    • Added a new file defining fsdp_pretrained_load_context, a context manager for temporarily setting FSDP-related environment variables.
  • src/twinkle/model/transformers/strategy/native_fsdp.py
    • Imported fsdp_pretrained_load_context for managing environment variables.
    • Added memory_efficient_init parameter to the NativeFSDPStrategy constructor and stored its value.
    • Introduced a pretrained_load_context method to return the FSDP-aware loading context.
    • Implemented meta-device initialization logic within wrap_model, including saving non-persistent buffers, moving the model to a meta-device, unbinding optimizer parameters, broadcasting sharded state dicts, and restoring buffers.
  • src/twinkle/model/transformers/transformers.py
    • Added memory_efficient_init parameter to the TransformersModel constructor and stored its value.
    • Wrapped the model_cls.from_pretrained call with self.strategy.pretrained_load_context() to enable FSDP-aware loading.
    • Passed the memory_efficient_init flag to both NativeFSDPStrategy and AccelerateStrategy initializations.
Activity
  • gemini-code-assist[bot] provided an initial summary of changes.
  • gemini-code-assist[bot] suggested improving a test case to verify environment variable manipulation during from_pretrained.
  • gemini-code-assist[bot] recommended clarifying the version specifier 4.57.x to <4.58 in both English and Chinese documentation for better clarity.
  • kevssim requested a review.
  • kevssim requested a summary.

@kevssim kevssim merged commit 3ea0e88 into modelscope:main Mar 25, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants