feat(nvidia): add ntops rms norm backend#616
Conversation
fa89de9 to
0ad2354
Compare
eff11f2 to
ea231e8
Compare
| weight_dtype, | ||
| output_dtype, | ||
| ): | ||
| import ntops |
There was a problem hiding this comment.
Fixed. ntops is now imported at the top of src/ninetoothed/ops/rms_norm/codegen.py.
|
|
||
| _PROJECT_DIR = pathlib.Path(__file__).resolve().parents[1] | ||
| _DEFAULT_DTYPES = ("float32", "float16", "bfloat16") | ||
| _DEFAULT_RMS_NORM_NDIMS = (2, 3) |
There was a problem hiding this comment.
为啥这种具体算子相关的东西还在这个文件里?
There was a problem hiding this comment.
Fixed. The top-level script no longer contains RmsNorm defaults or dtype/rank settings. It now discovers src/ninetoothed/ops/*/codegen.py modules and delegates generation to the operator-local module.
| unknown_ops = tuple(op for op in ops if op not in _OP_MODULES) | ||
|
|
||
| if unknown_ops: | ||
| raise ValueError(f"unsupported ninetoothed ops: {', '.join(unknown_ops)}") |
There was a problem hiding this comment.
信息中用“NineToothed”或者 ninetoothed,后面的也是如此。
There was a problem hiding this comment.
Fixed. User-facing/error text in this script now uses NineToothed; package references stay as ninetoothed.
| _DEFAULT_NDIMS = (2, 3) | ||
|
|
||
|
|
||
| def _premake( |
There was a problem hiding this comment.
这个函数真的必要嘛?不能在后面使用的地方用 functools.partial 之类的替代嘛?
There was a problem hiding this comment.
Partially kept by design after verification. I tried the functools.partial form remotely; NineToothed then exposed block_size in the generated launcher ABI, which forced the C++ bridge to pass it. The small _premake wrapper keeps block_size internal to RmsNorm codegen and keeps the launcher ABI smaller.
| ) | ||
|
|
||
|
|
||
| def _normalize_ndims(values): |
There was a problem hiding this comment.
Fixed. _normalize_ndims was removed; supported dims now stay as operator-local defaults in the RmsNorm codegen module.
There was a problem hiding this comment.
这个文件不应该叫 ninetoothed.h,暂时就叫 rms_norm.h 吧。话说,这个文件有可能生成出来嘛?因为这个文件的本质不就是准备参数嘛?能不能像隔壁 PyTorch 那样生成出来?
There was a problem hiding this comment.
Fixed the immediate naming issue: the bridge header is now src/ninetoothed/ops/rms_norm/rms_norm.h. On generation: this file is currently the InfiniOps runtime bridge that validates descriptors, adapts weight/scalar tensors, and calls the generated launcher. It could be generated later like the PyTorch bridge, but that would need a separate bridge generator; this PR keeps the first operator explicit and small.
| set(NINETOOTHED_PYTHON_EXECUTABLE "" CACHE FILEPATH "Python executable used to run ninetoothed code generation") | ||
| set(INFINIOPS_NINETOOTHED_OPS "rms_norm" CACHE STRING "Semicolon- or comma-separated NineToothed ops to generate") | ||
| set(INFINIOPS_NINETOOTHED_DTYPES "float32;float16;bfloat16" CACHE STRING "Semicolon- or comma-separated NineToothed dtypes to generate") | ||
| set(INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS "2;3" CACHE STRING "Semicolon- or comma-separated RmsNorm input ranks to generate with NineToothed") |
There was a problem hiding this comment.
构建系统里不要出现具体算子的内容,甚至最好不要出现 dtype 等内容,否则太不 scalable 了。
There was a problem hiding this comment.
Fixed. The top-level CMake no longer contains concrete operator, dtype, or rank cache entries. It only keeps NINETOOTHED_PYTHON_EXECUTABLE for selecting the Python interpreter used for codegen.
| } | ||
|
|
||
| private: | ||
| void* data_; |
There was a problem hiding this comment.
请再次核对 CONTRIBUTING.md,这种地方两行之间是需要有空格的。
There was a problem hiding this comment.
Fixed. Added the required blank lines between the private data members in src/ninetoothed/tensor.h.
|
|
||
| auto result = launch_infiniops_ninetoothed_rms_norm( | ||
| static_cast<NineToothedStream>(stream_), ninetoothed::Tensor(input), | ||
| ninetoothed::Tensor(const_cast<void*>(weight.data()), |
There was a problem hiding this comment.
这些地方不能 implicitly convert 嘛?必须要加上 ninetoothed::Tensor 嘛?
There was a problem hiding this comment.
Adjusted. The launcher call now passes local ninetoothed::Tensor adapters and relies on their implicit conversion to the generated NineToothedTensor parameter type. We still need explicit adapter construction at the InfiniOps boundary because raw Tensor, expanded weight shape/stride, and scalar arguments are not themselves generated tensor objects.
|
|
||
| string(REPLACE "," ";" _ninetoothed_ops "${INFINIOPS_NINETOOTHED_OPS}") | ||
| string(REPLACE "," ";" _ninetoothed_dtypes "${INFINIOPS_NINETOOTHED_DTYPES}") | ||
| string(REPLACE "," ";" _ninetoothed_rms_norm_ndims "${INFINIOPS_NINETOOTHED_RMS_NORM_NDIMS}") |
There was a problem hiding this comment.
构建系统中不要出现具体算子,甚至不要出现 dtype 这种东西。
There was a problem hiding this comment.
Fixed. src/CMakeLists.txt no longer expands concrete op, dtype, or RmsNorm rank arguments. It only invokes the generic NineToothed generator with an output directory; operator defaults live in operator-local codegen.
ea231e8 to
b664fe4
Compare
Summary
9.scripts/generate_ninetoothed_ops.pyas the generic build entrypoint; it discovers operator codegen modules undersrc/ninetoothed/ops/*/codegen.py.src/ninetoothed/ops/rms_norm/rms_norm.hand the common NineToothed tensor adapter insrc/ninetoothed/tensor.h.Motivation
This starts the InfiniOps integration path for kernels generated by
InfiniTensor/ntops. RmsNorm is the first operator so the build, dispatch, generated-header, and Python-wrapper paths can be reviewed with a small concrete surface.Closes #N/A
Type of Change
feat— new feature / new operator / new platformfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesPlatforms Affected
WITH_CPU)WITH_NVIDIA)WITH_ILUVATAR)WITH_METAX)WITH_CAMBRICON)WITH_MOORE)WITH_ASCEND)WITH_TORCH)Test Results on Supported Platforms
pytestResultnvidia,infiniops-ci/nvidia:latest; built withWITH_NVIDIA=ON,WITH_NINETOOTHED=ON,GENERATE_PYTHON_BINDINGS=ON; smoke-testedimplementation_index=9for shapes(13, 4)and(2, 3, 17).Validation output
Benchmark / Performance Impact
N/A. This PR adds the integration path and correctness coverage for the generated RmsNorm backend; performance benchmarking can follow after the interface is accepted.
Notes for Reviewers
WITH_NINETOOTHEDcontrols generation and wrapper discovery only. CMake does not carry concrete operator, dtype, or rank configuration.scripts/generate_ninetoothed_ops.pyis intentionally generic. Operator-specificntopsbuild configuration lives beside the operator undersrc/ninetoothed/ops/rms_norm/codegen.py.src/ninetoothed/tensor.hdepend on each generatedNineToothedTensordefinition.Checklist
Title, Branch, and Commits
<type>/xxx-yyyy-zzzz.master; the branch is rebased on currentmaster.fixup!/squash!/wipcommits remain.Scope and Design
General Code Hygiene
C++ Specific
clang-formatwas run with--dry-run --Werroron modified headers.new/deletewas introduced.Python Specific
ruff format --checkpasses.ruff checkpasses.Testing
WITH_NINETOOTHED=ONpasses.implementation_index=9.masteris required.Build, CI, and Tooling
ninetoothed/ntopsare build-time codegen dependencies whenWITH_NINETOOTHED=ON.Documentation
Security and Safety