Added a few features & Quantized Flux Kontext in FP4 #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
I have worked on improving replace function and added layer analytics to FP-Quant. I encountered a bug which I have fixed. I also demonstrated how to quickly quantize Flux Kontext model on the fly. Please take a look at the codes below. I'm willing to integrate the parts you confirm.
Bug Fix:
Replace
Layer Analytics
Quantizing Flux Kontext
Readme.md:
Here is time measurements to run the quantized model on my RTX 5090 and a quick example to use FP-Quant to quantize the models on the fly.
Runtime per step (bf16): ~790ms*

Runtime per step (partially quantized ["quantized_layer_time"]/layer_analytics_list[key]["nn_layer_time"] > 0.95] ): ~410ms (without cuda sync and cuda event, it is measured ~353ms)
Runtime per step (fully quantized): ~403ms
Here is the Nvidia's TensorRT fp4:
This is not a fair comparison without knowing the details of the machine and gpu clocks etc. I think this is an acceptable result if you also consider the fact that TensorRT is a highly optimized engine.