Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX #2439

RissyRan · 2025-10-02T00:13:25Z

Description

Migrate Gemma3 text layers to NNX

Tests

Training tests

export BASE_OUTPUT_PATH=gs://runner-maxtext-logs/$(date +%Y-%m-%d)
export DATASET_PATH=gs://maxtext-dataset

python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=${BASE_OUTPUT_PATH} dataset_path=${DATASET_PATH} run_name=gemma3-4b-after per_device_batch_size=4 enable_checkpointing=false model_name=gemma3-4b ici_fsdp_parallelism=4 steps=10 max_target_length=4096 async_checkpointing=false dataset_type=synthetic dtype=bfloat16 weight_dtype=bfloat16 scan_layers=True attention=flash

# Before

Total memory size: 30.7 GB, Output size: 5.4 GB, Temp size: 25.3 GB, Argument size: 5.4 GB, Host temp size: 0.0 GB.
Memstats: After params initialized:
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_3(process=0,(1,1,0,0))

completed step: 8, seconds: 1.600, TFLOP/s/device: 244.680, Tokens/s/device: 10239.808, total_weights: 65536, loss: 12.574


# After

Total memory size: 30.7 GB, Output size: 5.4 GB, Temp size: 25.3 GB, Argument size: 5.4 GB, Host temp size: 0.0 GB.
Memstats: After params initialized:
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 5.69 / 95.74 (5.943179%) on TPU_3(process=0,(1,1,0,0))

completed step: 8, seconds: 1.600, TFLOP/s/device: 244.663, Tokens/s/device: 10239.130, total_weights: 65536, loss: 12.619

Decoding tests

Noticed less memory usage (~10GB) in decoding in RAMstats.


# Before

Memstats: After load_params:
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_3(process=0,(1,1,0,0))

RAMstats: After load_params:
	Using (GB) 32.51 / 440.83 (7.374725%) -->  Available:405.7

Input `I love to` -> ` cook and I love to eat. I'm always looking for new recipes and ways to make my meals more interesting. I'm also a big fan of healthy eating, so I try to incorporate lots of fruits, vegetables, and lean protein into my diet.

# After

Memstats: After load_params:
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 3.61 / 95.74 (3.770629%) on TPU_3(process=0,(1,1,0,0))

RAMstats: After load_params:
	Using (GB) 22.7 / 440.83 (5.149377%) -->  Available:415.51

Input `I love to` -> ` cook and I love to eat. I'm always looking for new recipes and ways to make my meals more interesting. I'm also a big fan of healthy eating, so I try to incorporate lots of fruits, vegetables, and lean protein into my diet.

== Updated

JetStream tests

Diff: link, I noticed very minor differences for accuracy (performance are very close)
So I just rerun the test with this PR change, the accuracy increases from 0.5736 to 0.575 (main is 0.5742). This shows that randomness for the output accuracy (link).
Profile link before the change: https://xprof.corp.google.com/overview_page/ranran-13921892154856112923
Profile link after the change: https://xprof.corp.google.com/memory_viewer/ranran-1559256757512356993
ID matches: example 1, and example 2

Before:

Memstats: After load_params:
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_3(process=0,(1,1,0,0))

RAMstats: After load_params:
	Using (GB) 22.76 / 440.83 (5.162988%) -->  Available:415.28

After:

Memstats: After load_params:
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_0(process=0,(0,0,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_1(process=0,(1,0,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_2(process=0,(0,1,0,0))
	Using (GB) 1.81 / 95.74 (1.890537%) on TPU_3(process=0,(1,1,0,0))

RAMstats: After load_params:
	Using (GB) 22.7 / 440.83 (5.149377%) -->  Available:415.34

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

github-actions · 2025-10-02T00:17:14Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2025-10-02T18:45:35Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request migrates Gemma3DecoderLayer and Gemma3ScannableBlock from Flax Linen to NNX. The changes are well-structured and follow the imperative style of NNX. The core logic appears to be preserved correctly during the refactoring.

🔍 General Feedback

The migration to NNX is clean and improves the code's structure by separating layer definition from execution.
Good job on updating the necessary wrappers and configurations in decoders.py to accommodate the new NNX-based layers.
I've added a couple of minor suggestions to clean up unused imports and improve docstring consistency.

src/MaxText/layers/gemma3.py

shuningjin · 2025-10-03T22:22:36Z

The before/after JetStream accuracy differs slightly for gemma3: https://diff.googleplex.com/#key=1I4o3eENf3aa

This also happens for mixtral: https://diff.googleplex.com/#key=BznGYDr4RWma (from NNX Migration for Mixtral models #2166)
But it remains the same for llama3-70b: https://diff.googleplex.com/#key=fqj8G4bz5Bom (from Migrate LlamaDecoderLayer to NNX #2370)

src/MaxText/layers/gemma3.py

shuningjin

Thanks for the new test results! Jetstream acc diff could be due to missing enable_dropout=False in config, which is out of scope for this PR. I see you increased the profiling time from 200ms to 1000ms, this might have contributed to the HLO matching.

shuningjin

LGTM!

RissyRan requested review from bvandermoon, gagika, gobbleturk, jiangjy1982, parambole, richjames0, shralex, shuningjin and suexu1025 as code owners October 2, 2025 00:13

RissyRan force-pushed the gemma3_text_nnx branch from 100b132 to 2e56cb0 Compare October 2, 2025 00:14

RissyRan added the gemini-review label Oct 2, 2025

RissyRan force-pushed the gemma3_text_nnx branch from 2e56cb0 to dc4e6c6 Compare October 2, 2025 18:41

RissyRan added gemini-review and removed gemini-review labels Oct 2, 2025

github-actions bot reviewed Oct 2, 2025

View reviewed changes

src/MaxText/layers/gemma3.py Show resolved Hide resolved

src/MaxText/layers/gemma3.py Show resolved Hide resolved

RissyRan force-pushed the gemma3_text_nnx branch 3 times, most recently from f5ae73f to c7a12b6 Compare October 2, 2025 23:48

RissyRan changed the title ~~[WIP] Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX~~ Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX Oct 3, 2025

RissyRan force-pushed the gemma3_text_nnx branch from c7a12b6 to b90252e Compare October 3, 2025 04:55

RissyRan assigned bvandermoon and shuningjin Oct 3, 2025

bvandermoon reviewed Oct 6, 2025

View reviewed changes

src/MaxText/layers/gemma3.py Show resolved Hide resolved

Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX

3a140b7

RissyRan force-pushed the gemma3_text_nnx branch from b90252e to 3a140b7 Compare October 10, 2025 21:50

shuningjin reviewed Oct 10, 2025

View reviewed changes

shuningjin approved these changes Oct 10, 2025

View reviewed changes

bvandermoon approved these changes Oct 10, 2025

View reviewed changes

RissyRan unassigned shuningjin and bvandermoon Oct 10, 2025

RissyRan added the pull ready label Oct 10, 2025

copybara-service bot merged commit d6193a5 into main Oct 13, 2025
33 of 34 checks passed

copybara-service bot deleted the gemma3_text_nnx branch October 13, 2025 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX #2439

Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX #2439

Uh oh!

RissyRan commented Oct 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

shuningjin commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

shuningjin left a comment •

edited

Loading

Uh oh!

shuningjin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX #2439

Migrate Gemma3DecoderLayer and Gemma3ScannableBlock to NNX #2439

Uh oh!

Conversation

RissyRan commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Training tests

Decoding tests

JetStream tests

Checklist

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

shuningjin commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shuningjin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuningjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RissyRan commented Oct 2, 2025 •

edited

Loading

shuningjin commented Oct 3, 2025 •

edited

Loading

shuningjin left a comment •

edited

Loading