[feat] fsdp2 memory_efficient_init by kevssim · Pull Request #117 · modelscope/twinkle

kevssim · 2026-03-18T06:37:06Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Add memory_efficient_init support for FSDP (including Accelerate Strategy and Native FSDP Strategy) to reduce peak memory and VRAM usage during the model initialization phase.

Core idea: Before FSDP wrapping, only rank 0 holds the full parameters, while other ranks move the model to the meta device. After wrapping is completed, parameters are broadcast and sharded across ranks via broadcast + distribute_tensor, avoiding each rank loading the full model weights.

Main changes:

NativeFSDPStrategy.wrap_model: Added a meta-device initialization process — save state_dict → to('meta') → fully_shard → broadcast sharded parameters → restore non-persistent buffers
AccelerateStrategy: Achieves the same effect via the cpu_ram_efficient_loading configuration option and environment variable context manager
Added load_context.py: Provides fsdp_pretrained_load_context, which temporarily sets the ACCELERATE_USE_FSDP / FSDP_CPU_RAM_EFFICIENT_LOADING environment variables during from_pretrained

Note: The optimization currently only applies to transformers <= 4.57.x; for transformers >= 5.0.x, it may lead to negative performance impact.

Experiment results

ENV: 4xH800, transformers==4.57.6, Qwen3-8B

Sampling CPU memory and GPU memory during the transformers.from_pretrained + optimizer creation + wrap_model process

Accelerate: memory_efficient=True vs. memory_efficient=False

Accelerate vs. Native FSDP

…uffers, _restore_non_persistent_buffers helpers

…ss memory_efficient through _lazy_wrap_model

…istent buffer detection

… into optimize_fsdp_init

gemini-code-assist · 2026-03-18T06:37:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the memory footprint during the initialization of FSDP models, particularly for large Transformer architectures. By leveraging meta-device initialization and distributed state dict handling, it allows models to be loaded and sharded more efficiently across multiple GPUs, reducing the peak memory consumption on both CPU and GPU. The changes ensure that both native FSDP and Accelerate-based FSDP strategies can benefit from these memory improvements, making it feasible to train larger models or use more aggressive sharding configurations.

Highlights

Memory-Efficient FSDP Initialization: Introduced a new memory_efficient_init parameter in TransformersModel to enable optimized FSDP model loading, significantly reducing CPU and GPU memory usage during initialization for large models.
Native FSDP Strategy Enhancements: Implemented a meta-device flow within the NativeFSDPStrategy's wrap_model method. This involves moving the model to a meta-device, performing fully_shard, and then broadcasting the sharded state dictionary from rank 0 to materialize parameters on the correct devices.
Accelerate Strategy Integration: Ensured compatibility with the Accelerate FSDP strategy by conditionally setting ACCELERATE_USE_FSDP and FSDP_CPU_RAM_EFFICIENT_LOADING environment variables around from_pretrained calls, allowing Accelerate's built-in memory-efficient loading path to be utilized.
Non-Persistent Buffer Handling: Added utility functions (_get_non_persistent_buffers, _restore_non_persistent_buffers) to correctly save and restore non-persistent model buffers when moving models to and from meta-devices, preventing data loss.
Comprehensive Testing: Included new unit and integration tests to validate the memory-efficient FSDP initialization process, covering state dict broadcasting, buffer handling, and end-to-end model training with the new optimizations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant optimization for FSDP model initialization by implementing a memory-efficient path. The changes are well-architected, with clear logic for both the accelerate and native_fsdp strategies. The use of a meta-device initialization flow, along with robust handling of environment variables for integration with the transformers library, is commendable. The addition of a comprehensive test suite, including unit, integration, and regression tests, greatly enhances the reliability of this new feature. I have one suggestion to further improve the test coverage for the environment variable manipulation logic.

tests/strategy/test_fsdp2_memory_efficient_init.py

…or device issues

…st_sharded_state_dict

…erve device_mesh

…ansformers.py

kevssim · 2026-03-25T03:23:09Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a memory_efficient_init parameter to enable memory-efficient FSDP model initialization, where only rank 0 loads full weights and broadcasts sharded parameters. This feature is integrated into AccelerateStrategy and NativeFSDPStrategy, and new utility functions are added to manage state dicts and buffers during this process. The documentation has been updated to reflect this new parameter. The reviewer suggested improving the clarity of the transformers version specifier in the documentation from 4.57.x to <4.58> for both English and Chinese versions.

docs/source_en/Components/Model/TransformersModel.md

docs/source_zh/组件/模型/TransformersModel.md

kevssim · 2026-03-25T03:34:43Z

/gemini summary

gemini-code-assist · 2026-03-25T03:35:01Z

Summary of Changes

This pull request introduces a memory-efficient initialization mechanism for FSDP (Fully Sharded Data Parallel) models, applicable to both native PyTorch FSDP and Accelerate strategies. The core idea is to reduce peak memory and VRAM usage during model loading and initialization by having only rank 0 load full model weights, while other ranks use a meta-device and then receive sharded parameters via broadcast. This optimization allows for more efficient handling of large models, particularly with transformers versions up to 4.57.x.

Highlights

Memory-Efficient FSDP Initialization: Introduced a new memory_efficient_init parameter across the TransformersModel and its strategies to enable optimized FSDP model loading, significantly reducing peak memory usage during initialization.
Native FSDP Strategy Enhancement: Implemented a meta-device initialization flow within NativeFSDPStrategy.wrap_model. This involves moving the model to a meta-device, performing fully_shard, broadcasting sharded parameters from rank 0, and restoring non-persistent buffers to reduce memory footprint.
Accelerate Strategy Integration: Integrated memory-efficient initialization for Accelerate by introducing a context manager that conditionally sets ACCELERATE_USE_FSDP and FSDP_CPU_RAM_EFFICIENT_LOADING environment variables during from_pretrained calls.
Dedicated Load Context Module: Added a new module load_context.py providing fsdp_pretrained_load_context, a context manager to temporarily set environment variables for FSDP-aware loading in transformers.
Documentation Updates: Updated both English and Chinese documentation for TransformersModel to include the new memory_efficient_init parameter and its usage details.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/source_en/Components/Model/TransformersModel.md
- Added memory_efficient_init parameter to the TransformersModel constructor.
- Documented the purpose and behavior of the memory_efficient_init parameter, including its default value and version compatibility notes.
docs/source_zh/组件/模型/TransformersModel.md
- Added memory_efficient_init parameter to the TransformersModel constructor (Chinese).
- Documented the purpose and behavior of the memory_efficient_init parameter in Chinese, including its default value and version compatibility notes.
src/twinkle/model/transformers/strategy/accelerate.py
- Imported fsdp_pretrained_load_context for managing environment variables.
- Added memory_efficient_init parameter to the AccelerateStrategy constructor and stored its value.
- Introduced a pretrained_load_context method to return the FSDP-aware loading context.
- Modified _fsdp_config_from_device_mesh to accept and utilize the memory_efficient flag for cpu_ram_efficient_loading.
src/twinkle/model/transformers/strategy/load_context.py
- Added a new file defining fsdp_pretrained_load_context, a context manager for temporarily setting FSDP-related environment variables.
src/twinkle/model/transformers/strategy/native_fsdp.py
- Imported fsdp_pretrained_load_context for managing environment variables.
- Added memory_efficient_init parameter to the NativeFSDPStrategy constructor and stored its value.
- Introduced a pretrained_load_context method to return the FSDP-aware loading context.
- Implemented meta-device initialization logic within wrap_model, including saving non-persistent buffers, moving the model to a meta-device, unbinding optimizer parameters, broadcasting sharded state dicts, and restoring buffers.
src/twinkle/model/transformers/transformers.py
- Added memory_efficient_init parameter to the TransformersModel constructor and stored its value.
- Wrapped the model_cls.from_pretrained call with self.strategy.pretrained_load_context() to enable FSDP-aware loading.
- Passed the memory_efficient_init flag to both NativeFSDPStrategy and AccelerateStrategy initializations.

Activity

gemini-code-assist[bot] provided an initial summary of changes.
gemini-code-assist[bot] suggested improving a test case to verify environment variable manipulation during from_pretrained.
gemini-code-assist[bot] recommended clarifying the version specifier 4.57.x to <4.58 in both English and Chinese documentation for better clarity.
kevssim requested a review.
kevssim requested a summary.

kevssim added 4 commits March 18, 2026 14:29

feat(fsdp2): add _broadcast_sharded_state_dict, _get_non_persistent_b…

747769b

…uffers, _restore_non_persistent_buffers helpers

feat(fsdp2): enable cpu_ram_efficient_loading for both strategies; pa…

a69fb6c

…ss memory_efficient through _lazy_wrap_model

refactor(fsdp2): use _non_persistent_buffers_set for precise non-pers…

c015e13

…istent buffer detection

Merge branch 'optimize_fsdp_init' of https://github.com/kevssim/twinkle…

5fa4c4b

… into optimize_fsdp_init

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

tests/strategy/test_fsdp2_memory_efficient_init.py Outdated Show resolved Hide resolved

kevssim added 24 commits March 18, 2026 15:11

wip

587f001

test(fsdp2): make tests platform-agnostic (cuda/npu) via Platform API

983cdbc

fix(test): pass inputs as List[InputFeature] to forward_backward

d5d2832

fix(test): add position_ids to e2e test batch

5973173

fix(test): simplify e2e test to only verify wrap_model, avoid process…

accd03b

…or device issues

fix(fsdp2): handle non-DTensor params (e.g. tied weights) in _broadca…

2c72aa4

…st_sharded_state_dict

fix(fsdp2): move remaining CPU/meta params to device after tie_weights

bf0e155

debug: add verbose logging to e2e test to diagnose CPU param issue

e35a3d9

debug: add verbose logging to wrap_model to trace execution path

60c4a3a

debug: add verbose logging to _lazy_wrap_model

93ef7e4

debug: add device_mesh check to e2e test

638a996

debug: print mesh before TransformersModel init

a61db10

fix(test): call twinkle.initialize() before TransformersModel to pres…

a06e894

…erve device_mesh

cleanup: remove all debug print statements from native_fsdp.py and tr…

f8def97

…ansformers.py

wip

13c1d5f

lint

0438d9e

fix

44bf3d4

wip

3b82d1c

wip

beaa4fd

wip

560eb23

wip

d8f39b1

lint

cbb6191

lint

38e75cd

wip

e482625

clean

00fd199

kevssim closed this Mar 23, 2026

kevssim reopened this Mar 24, 2026

kevssim added 6 commits March 24, 2026 14:36

Merge remote-tracking branch 'origin/main' into optimize_fsdp_init

0b77d31

wip

404096c

wip

eb13cda

rename

9158465

wip

9d97d84

wip

5fbd998

kevssim changed the title ~~[WIP]optimize memory usage of fsdp2 init model~~ [feat] fsdp memory_efficient_init Mar 25, 2026

kevssim marked this pull request as ready for review March 25, 2026 03:20

doc

109cf28

kevssim changed the title ~~[feat] fsdp memory_efficient_init~~ [feat] fsdp2 memory_efficient_init Mar 25, 2026

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

docs/source_en/Components/Model/TransformersModel.md Outdated Show resolved Hide resolved

docs/source_zh/组件/模型/TransformersModel.md Outdated Show resolved Hide resolved

tastelikefeet approved these changes Mar 25, 2026

View reviewed changes

fix

62e680c

kevssim merged commit 3ea0e88 into modelscope:main Mar 25, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] fsdp2 memory_efficient_init#117

[feat] fsdp2 memory_efficient_init#117
kevssim merged 37 commits intomodelscope:mainfrom
kevssim:optimize_fsdp_init

kevssim commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kevssim commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

kevssim commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevssim commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Accelerate: memory_efficient=True vs. memory_efficient=False

Accelerate vs. Native FSDP

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kevssim commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kevssim commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevssim commented Mar 18, 2026 •

edited

Loading