Skip to content

feat(nvidia): add ntops rms norm backend#616

Draft
voltjia wants to merge 1 commit into
masterfrom
feat/nvidia-ntops-rms-norm
Draft

feat(nvidia): add ntops rms norm backend#616
voltjia wants to merge 1 commit into
masterfrom
feat/nvidia-ntops-rms-norm

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 20, 2026

Summary

  • Add a NineToothed/ntops generated RmsNorm implementation registered on backend slot 9.
  • Add scripts/generate_ninetoothed_ops.py as the generic build entrypoint; it discovers operator codegen modules under src/ninetoothed/ops/*/codegen.py.
  • Add the RmsNorm bridge in src/ninetoothed/ops/rms_norm/rms_norm.h and the common NineToothed tensor adapter in src/ninetoothed/tensor.h.

Motivation

This starts the InfiniOps integration path for kernels generated by InfiniTensor/ntops. RmsNorm is the first operator so the build, dispatch, generated-header, and Python-wrapper paths can be reviewed with a small concrete surface.

Closes #N/A

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA Yes Pass Remote nvidia, infiniops-ci/nvidia:latest; built with WITH_NVIDIA=ON, WITH_NINETOOTHED=ON, GENERATE_PYTHON_BINDINGS=ON; smoke-tested implementation_index=9 for shapes (13, 4) and (2, 3, 17).
Iluvatar N/A N/A This PR only wires the NineToothed RmsNorm path into the NVIDIA CUDA caller.
MetaX N/A N/A This PR only wires the NineToothed RmsNorm path into the NVIDIA CUDA caller.
Cambricon N/A N/A This PR only wires the NineToothed RmsNorm path into the NVIDIA CUDA caller.
Moore N/A N/A This PR only wires the NineToothed RmsNorm path into the NVIDIA CUDA caller.
Ascend N/A N/A This PR only wires the NineToothed RmsNorm path into the NVIDIA CUDA caller.
Validation output
ruff format --check scripts/generate_ninetoothed_ops.py src/ninetoothed/ops/rms_norm/codegen.py scripts/generate_wrappers.py tests/test_generate_ninetoothed_ops.py
4 files already formatted

ruff check scripts/generate_ninetoothed_ops.py src/ninetoothed/ops/rms_norm/codegen.py scripts/generate_wrappers.py tests/test_generate_ninetoothed_ops.py
All checks passed!

pytest -q tests/test_generate_ninetoothed_ops.py
1 passed in 0.02s

clang-format --dry-run --Werror src/ninetoothed/tensor.h src/ninetoothed/ops/rms_norm/rms_norm.h

cmake -S . -B build/ninetoothed-review-20260525 -G Ninja -DWITH_NVIDIA=ON -DWITH_NINETOOTHED=ON -DGENERATE_PYTHON_BINDINGS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build/ninetoothed-review-20260525 --target ops -j 8
[81/81] Linking CUDA shared module src/ops.cpython-310-x86_64-linux-gnu.so

PYTHONPATH=/workspace/build/ninetoothed-review-20260525/src python3 /tmp/ntops_rms_smoke.py
impls [0, 9]
(13, 4) 2.1457672119140625e-06 True
(2, 3, 17) 3.0994415283203125e-06 True

Benchmark / Performance Impact

N/A. This PR adds the integration path and correctness coverage for the generated RmsNorm backend; performance benchmarking can follow after the interface is accepted.

Notes for Reviewers

  • WITH_NINETOOTHED controls generation and wrapper discovery only. CMake does not carry concrete operator, dtype, or rank configuration.
  • scripts/generate_ninetoothed_ops.py is intentionally generic. Operator-specific ntops build configuration lives beside the operator under src/ninetoothed/ops/rms_norm/codegen.py.
  • The common tensor adapter remains independent from generated op headers by using a templated conversion operator; this avoids making src/ninetoothed/tensor.h depend on each generated NineToothedTensor definition.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz.
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit.
  • No stray merge commits from master; the branch is rebased on current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal and scoped to the NineToothed RmsNorm integration.
  • No dead code, debug prints, or ownerless TODOs.
  • No unrelated formatting churn.
  • Public API changes are intentional and covered by tests/build validation.

General Code Hygiene

  • Code is self-explanatory; comments are only used where needed.
  • Modified files end with a single trailing newline.
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Comments and messages are in English.

C++ Specific

  • Code follows the repository style.
  • clang-format was run with --dry-run --Werror on modified headers.
  • No exceptions are thrown.
  • No raw new/delete was introduced.

Python Specific

  • ruff format --check passes.
  • ruff check passes.
  • Type hints/comments are kept consistent with surrounding code.

Testing

  • Remote pytest for the generator test passes.
  • Remote NVIDIA build with WITH_NINETOOTHED=ON passes.
  • Remote CUDA smoke explicitly calls implementation_index=9.
  • N/A. This is not a bug fix, so no regression test against master is required.

Build, CI, and Tooling

  • The affected NVIDIA build path compiles from a fresh build directory.
  • Existing GPU backend mutual exclusion remains unchanged.
  • No new runtime dependency was added to the normal runtime path; ninetoothed/ntops are build-time codegen dependencies when WITH_NINETOOTHED=ON.

Documentation

  • PR notes document the current design and scope.
  • No user-visible breaking change is introduced.

Security and Safety

  • No secrets or internal data committed.
  • Third-party code is used as a build-time dependency rather than vendored.
  • No unsafe pointer arithmetic beyond the existing tensor adapter boundary was added.

Comment thread CMakeLists.txt Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread scripts/generate_ninetoothed_ops.py Outdated
Comment thread src/native/cuda/nvidia/ops/rms_norm/ninetoothed.h Outdated
@voltjia voltjia force-pushed the feat/nvidia-ntops-rms-norm branch from fa89de9 to 0ad2354 Compare May 20, 2026 11:23
Comment thread src/native/ninetoothed/codegen.py Outdated
Comment thread CMakeLists.txt Outdated
Comment thread src/ninetoothed/ops/rms_norm/rms_norm.h
Comment thread src/native/cuda/nvidia/ops/rms_norm/ninetoothed.h Outdated
Comment thread src/native/cuda/nvidia/ops/rms_norm/ninetoothed.h Outdated
Comment thread src/native/ninetoothed/tensor.h Outdated
Comment thread src/ninetoothed/tensor.h
Comment thread CMakeLists.txt Outdated
Comment thread tests/test_generate_ninetoothed_ops.py Outdated
Comment thread src/native/cuda/nvidia/ops/rms_norm/ninetoothed.h Outdated
@voltjia voltjia force-pushed the feat/nvidia-ntops-rms-norm branch 2 times, most recently from eff11f2 to ea231e8 Compare May 25, 2026 08:28
Comment thread src/ninetoothed/ops/rms_norm/codegen.py Outdated
weight_dtype,
output_dtype,
):
import ntops
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥不直接在最上面 import

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. ntops is now imported at the top of src/ninetoothed/ops/rms_norm/codegen.py.

Comment thread scripts/generate_ninetoothed_ops.py Outdated

_PROJECT_DIR = pathlib.Path(__file__).resolve().parents[1]
_DEFAULT_DTYPES = ("float32", "float16", "bfloat16")
_DEFAULT_RMS_NORM_NDIMS = (2, 3)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥这种具体算子相关的东西还在这个文件里?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The top-level script no longer contains RmsNorm defaults or dtype/rank settings. It now discovers src/ninetoothed/ops/*/codegen.py modules and delegates generation to the operator-local module.

Comment thread scripts/generate_ninetoothed_ops.py Outdated
unknown_ops = tuple(op for op in ops if op not in _OP_MODULES)

if unknown_ops:
raise ValueError(f"unsupported ninetoothed ops: {', '.join(unknown_ops)}")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

信息中用“NineToothed”或者 ninetoothed,后面的也是如此。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. User-facing/error text in this script now uses NineToothed; package references stay as ninetoothed.

_DEFAULT_NDIMS = (2, 3)


def _premake(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数真的必要嘛?不能在后面使用的地方用 functools.partial 之类的替代嘛?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially kept by design after verification. I tried the functools.partial form remotely; NineToothed then exposed block_size in the generated launcher ABI, which forced the C++ bridge to pass it. The small _premake wrapper keeps block_size internal to RmsNorm codegen and keeps the launcher ABI smaller.

Comment thread src/ninetoothed/ops/rms_norm/codegen.py Outdated
)


def _normalize_ndims(values):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这函数是干啥使的?有啥必要?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. _normalize_ndims was removed; supported dims now stay as operator-local defaults in the RmsNorm codegen module.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不应该叫 ninetoothed.h,暂时就叫 rms_norm.h 吧。话说,这个文件有可能生成出来嘛?因为这个文件的本质不就是准备参数嘛?能不能像隔壁 PyTorch 那样生成出来?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the immediate naming issue: the bridge header is now src/ninetoothed/ops/rms_norm/rms_norm.h. On generation: this file is currently the InfiniOps runtime bridge that validates descriptors, adapts weight/scalar tensors, and calls the generated launcher. It could be generated later like the PyTorch bridge, but that would need a separate bridge generator; this PR keeps the first operator explicit and small.

Comment thread CMakeLists.txt Outdated
set(NINETOOTHED_PYTHON_EXECUTABLE "" CACHE FILEPATH "Python executable used to run ninetoothed code generation")
set(INFINIOPS_NINETOOTHED_OPS "rms_norm" CACHE STRING "Semicolon- or comma-separated NineToothed ops to generate")
set(INFINIOPS_NINETOOTHED_DTYPES "float32;float16;bfloat16" CACHE STRING "Semicolon- or comma-separated NineToothed dtypes to generate")
set(INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS "2;3" CACHE STRING "Semicolon- or comma-separated RmsNorm input ranks to generate with NineToothed")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

构建系统里不要出现具体算子的内容,甚至最好不要出现 dtype 等内容,否则太不 scalable 了。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The top-level CMake no longer contains concrete operator, dtype, or rank cache entries. It only keeps NINETOOTHED_PYTHON_EXECUTABLE for selecting the Python interpreter used for codegen.

Comment thread src/ninetoothed/tensor.h
}

private:
void* data_;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请再次核对 CONTRIBUTING.md,这种地方两行之间是需要有空格的。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added the required blank lines between the private data members in src/ninetoothed/tensor.h.


auto result = launch_infiniops_ninetoothed_rms_norm(
static_cast<NineToothedStream>(stream_), ninetoothed::Tensor(input),
ninetoothed::Tensor(const_cast<void*>(weight.data()),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些地方不能 implicitly convert 嘛?必须要加上 ninetoothed::Tensor 嘛?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted. The launcher call now passes local ninetoothed::Tensor adapters and relies on their implicit conversion to the generated NineToothedTensor parameter type. We still need explicit adapter construction at the InfiniOps boundary because raw Tensor, expanded weight shape/stride, and scalar arguments are not themselves generated tensor objects.

Comment thread src/CMakeLists.txt Outdated

string(REPLACE "," ";" _ninetoothed_ops "${INFINIOPS_NINETOOTHED_OPS}")
string(REPLACE "," ";" _ninetoothed_dtypes "${INFINIOPS_NINETOOTHED_DTYPES}")
string(REPLACE "," ";" _ninetoothed_rms_norm_ndims "${INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS}")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

构建系统中不要出现具体算子,甚至不要出现 dtype 这种东西。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. src/CMakeLists.txt no longer expands concrete op, dtype, or RmsNorm rank arguments. It only invokes the generic NineToothed generator with an output directory; operator defaults live in operator-local codegen.

@voltjia voltjia force-pushed the feat/nvidia-ntops-rms-norm branch from ea231e8 to b664fe4 Compare May 25, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant