[NPU] FIX fused_linear_jsd ub overflow and OOM on NPU by MAYUNHUI666 · Pull Request #1043 · linkedin/Liger-Kernel

MAYUNHUI666 · 2026-01-27T02:47:55Z

Summary

Distinguish the memory limits and the maximum supported shape across different hardware scenarios

Testing Done

Hardware Type: Ascend NPU 910B2
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403 · 2026-01-29T17:31:41Z

Given the device-specific shapes currently scattered across the codebase, I opened #1051 to discuss this issues and a possible path toward standardization. Feedback is very welcome!

Tcc0403 · 2026-01-29T17:43:55Z

src/liger_kernel/ops/fused_linear_jsd.py

 # However, setting limit as 65536 as in LayerNorm tutorial is faster because of less register spilling
 # The optimal maximum block size depends on your hardware, your kernel, and your dtype
-MAX_FUSED_SIZE = 4096 if infer_device() == "xpu" else 65536 // 2
+MAX_FUSED_SIZE = 4096 if infer_device() == "npu" else 65536 // 2


append instead of replace

Suggested change

MAX_FUSED_SIZE = 4096 if infer_device() == "npu" else 65536 // 2

MAX_FUSED_SIZE = 4096 if infer_device() in ["npu", "xpu"] else 65536 // 2

Tcc0403 · 2026-01-29T17:47:05Z

benchmark/scripts/benchmark_fused_linear_jsd.py

+    gpu_memory_gbs = get_total_gpu_memory()
+    if gpu_memory_gbs >= 69:
+        vocab_size = 128256
+    else:
+        vocab_size = 65536
+
    common_configs = {
        "kernel_name": "fused_linear_jsd",
        "x_name": "BT",
        "x_label": "B x T",
        "x_values": [2**i for i in range(10, 14)],
        "kernel_providers": ["liger", "torch"],
-        "extra_benchmark_configs": [{"H": 4096, "V": 128256, "mode": "forward", "dtype": torch.bfloat16}],
+        "extra_benchmark_configs": [{"H": 4096, "V": vocab_size, "mode": "forward", "dtype": torch.bfloat16}],


Let's lower the upper bound of x_values instead of vocab_size for now.

We can discuss what configs should be scalable if there's memory constraint, see #1051

MAYUNHUI666 added 3 commits January 27, 2026 09:19

【NPU】fixed oom error for benchmark_fused_linear_jsd.py

21a6747

[NPU] FIX fused_linear_jsd ub overflow on NPU

1e8e5f8

Update benchmark_fused_linear_jsd.py

74b77d1

Tcc0403 reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] FIX fused_linear_jsd ub overflow and OOM on NPU#1043

[NPU] FIX fused_linear_jsd ub overflow and OOM on NPU#1043
MAYUNHUI666 wants to merge 3 commits intolinkedin:mainfrom
MAYUNHUI666:main

MAYUNHUI666 commented Jan 27, 2026

Uh oh!

Tcc0403 commented Jan 29, 2026

Uh oh!

Tcc0403 Jan 29, 2026

Uh oh!

Tcc0403 Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	MAX_FUSED_SIZE = 4096 if infer_device() == "npu" else 65536 // 2
	MAX_FUSED_SIZE = 4096 if infer_device() in ["npu", "xpu"] else 65536 // 2

Conversation

MAYUNHUI666 commented Jan 27, 2026

Summary

Testing Done

Uh oh!

Tcc0403 commented Jan 29, 2026

Uh oh!

Tcc0403 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants