Skip to content

Add Finegrained FP8 #11647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Add Finegrained FP8 #11647

wants to merge 13 commits into from

Conversation

MekkCyber
Copy link

What does this PR do?

Adds finegrained FP8

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul
Copy link
Member

Just for bookkeeping, relaying stuff from our DM.

I had to make the following changes to make this PR work:

Expand
diff --git a/src/diffusers/models/modeling_utils.py b/src/diffusers/models/modeling_utils.py
index 638c5fbfb..737525143 100644
--- a/src/diffusers/models/modeling_utils.py
+++ b/src/diffusers/models/modeling_utils.py
@@ -1238,8 +1238,8 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
         }
 
         # Dispatch model with hooks on all devices if necessary
-        print(model.transformer_blocks[0].attn.to_q.weight)
-        print(model.transformer_blocks[0].attn.to_q.weight_scale_inv)
+        # print(model.transformer_blocks[0].attn.to_q.weight)
+        # print(model.transformer_blocks[0].attn.to_q.weight_scale_inv)
         if device_map is not None:
             device_map_kwargs = {
                 "device_map": device_map,
diff --git a/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py b/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
index 5dec8b0b8..7212befcd 100644
--- a/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
+++ b/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
@@ -90,9 +90,9 @@ class FinegrainedFP8Quantizer(DiffusersQuantizer):
         Quantizes weights to FP8 format using Block-wise quantization
         """
         # print("############ create quantized param ########")
-        from accelerate.utils import set_module_tensor_to_device
+        # from accelerate.utils import set_module_tensor_to_device
 
-        set_module_tensor_to_device(model, param_name, target_device, param_value)
+        # set_module_tensor_to_device(model, param_name, target_device, param_value)
 
         module, tensor_name = get_module_from_name(model, param_name)
 
@@ -131,8 +131,8 @@ class FinegrainedFP8Quantizer(DiffusersQuantizer):
         scale = scale.reshape(scale_orig_shape).squeeze().reciprocal()
 
         # Load into the model
-        module._buffers[tensor_name] = quantized_param.to(target_device)
-        module._buffers["weight_scale_inv"] = scale.to(target_device)
+        module._parameters[tensor_name] = quantized_param.to(target_device)
+        module._parameters["weight_scale_inv"] = scale.to(target_device)
         # print("_buffers[0]", module._buffers["weight_scale_inv"])
 
     def check_if_quantized_param(

Inference code:

import torch
from diffusers import FluxPipeline, AutoModel, FinegrainedFP8Config
from diffusers.quantizers.finegrained_fp8.utils import FP8Linear

model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16

quantization_config = FinegrainedFP8Config(
    modules_to_not_convert=["norm", "proj_out", "x_embedder"], # weight_block_size=(32, 32)
)
transformer = AutoModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=dtype,
    device_map="cuda"
)
pipe = FluxPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=dtype,
)
pipe.to("cuda")

for name, module in pipe.transformer.named_modules():
    if isinstance(module, FP8Linear) and getattr(module, "weight_scale_inv", None) is not None:
        if module.weight_scale_inv.ndim == 1:
            print(name, module.weight_scale_inv.shape)


print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")

The modules_to_not_convert includes proj_out and x_embedder because otherwise, we violate the shape constraint on scale (scale.ndim == 2).

@sayakpaul sayakpaul requested review from sayakpaul and DN6 and removed request for sayakpaul June 16, 2025 08:12
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this! Would be nice to also have some benchmarks:

  • With and without finegrained FP8 quant (with visual outputs)
  • With and without torch.compile

class FinegrainedFP8Quantizer(DiffusersQuantizer):
"""
FP8 quantization implementation supporting both standard and MoE models.
Supports both e4m3fn formats based on platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand on this a bit? What are both e4m3fn formats? How does that vary depending on the platform?

Comment on lines +133 to +135
# Load into the model
module._parameters[tensor_name] = quantized_param.to(target_device)
module._parameters["weight_scale_inv"] = scale.to(target_device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to tackle buffers as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by that? Aren’t weights usually just parameters?

Comment on lines +484 to +498
def _check_serialization_expected_slice(self, expected_slice, device):
quantized_model = self.get_dummy_model(device)

with tempfile.TemporaryDirectory() as tmp_dir:
quantized_model.save_pretrained(tmp_dir, safe_serialization=False)
loaded_quantized_model = FluxTransformer2DModel.from_pretrained(
tmp_dir, torch_dtype=torch.bfloat16, use_safetensors=False
).to(device=torch_device)

inputs = self.get_dummy_tensor_inputs(torch_device)
output = loaded_quantized_model(**inputs)[0]

output_slice = output.flatten()[-9:].detach().float().cpu().numpy()

self.assertTrue(numpy_cosine_similarity_distance(output_slice, expected_slice) < 1e-3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of delegating certain calls to other methods, we can have all of the implementations under this one. This way, everything remains self-contained. Furthermore, since this test class doesn't have other tests, we don't have to modularize too much.

WDYT?

Comment on lines +542 to +557
text_encoder = CLIPTextModel.from_pretrained(
model_id, subfolder="text_encoder", torch_dtype=torch.bfloat16, cache_dir=cache_dir
)
text_encoder_2 = T5EncoderModel.from_pretrained(
model_id, subfolder="text_encoder_2", torch_dtype=torch.bfloat16, cache_dir=cache_dir
)
tokenizer = CLIPTokenizer.from_pretrained(
model_id, subfolder="tokenizer", cache_dir=cache_dir
)
tokenizer_2 = AutoTokenizer.from_pretrained(
model_id, subfolder="tokenizer_2", cache_dir=cache_dir
)
vae = AutoencoderKL.from_pretrained(
model_id, subfolder="vae", torch_dtype=torch.bfloat16, cache_dir=cache_dir
)
scheduler = FlowMatchEulerDiscreteScheduler()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to initialize these components like this.

For example, if we do:

transformer = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    device_map=torch_device,
)
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")

It should work. It's simpler and I would prefer this method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will change that

# A difference of 0.06 in normalized pixel space (-1 to 1), corresponds to a difference of
# 0.06 / 2 * 255 = 7.65 in pixel space (0 to 255). On our CI runners, the difference is about 0.04,
# on DGX it is 0.06, and on audace it is 0.037. So, we are using a tolerance of 0.06 here.
self.assertTrue(np.allclose(output, loaded_output, atol=0.06))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce this tolerance?

@sayakpaul sayakpaul requested a review from SunMarc June 16, 2025 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants