[English|中文版]
FlagTensor is part of FlagOS. FlagTensor is a tensor-primitive library oriented toward multiple hardware backends. It provides high-performance implementations of common tensor primitives (for example, unary, binary, and contraction operations), and supports correctness and performance comparisons against cuTensor baselines.
FlagTensor is a high-performance tensor-primitive library implemented with the Triton programming language launched by OpenAI.
This repository provides two GitHub Actions workflows under .github/workflows:
flagtensor-ci: split intocorrectnessandperfjobs for smoke-style automated validation.flagtensor-weekly: runs the weekly correctness and benchmark pipeline from an operator list.
The authoritative operator list lives in conf/operators.yaml.
It is used to track:
- operator category
- implementation path
- correctness / benchmark entry points
- supported benchmark modes
- blocked operators and skip reasons
By default, the local CI and weekly runners discover operators from this registry.
Install and enable pre-commit locally:
pip install pre-commit
pre-commit installThe repository ships a .pre-commit-config.yaml with YAML, formatting, import ordering, lint, and C/C++ formatting hooks.
Both workflows support the benchmark mode input:
kerneloperator
The default mode is kernel.
- Trigger
flagtensor-cifromworkflow_dispatchwhen you want a quick automated check of the currently covered operators. - Trigger
flagtensor-weeklyfromworkflow_dispatchwhen you want to run the weekly-style multi-operator pipeline. - For
flagtensor-weekly, you can optionally provide a custom operator list file; otherwise the workflow generates one from the discovered tests.
Run CI correctness locally:
python tools/run_flagtensor_ci.py --smoke --run-correctness --exclude-op tensor_contraction_trinary --mode kernel --results-dir ci_results_correctnessRun CI perf locally in kernel mode:
python tools/run_flagtensor_ci.py --smoke --run-perf --exclude-op tensor_contraction_trinary --mode kernel --results-dir ci_results_perfRun CI perf locally in operator mode:
python tools/run_flagtensor_ci.py --smoke --run-perf --exclude-op tensor_contraction_trinary --mode operator --results-dir ci_results_perf_operatorRun weekly locally in kernel mode:
python tools/run_flagtensor_weekly.py --project-root . --gpus 0 --mode kernel --results-dir weekly_results_ciRun weekly locally in operator mode:
python tools/run_flagtensor_weekly.py --project-root . --gpus 0 --mode operator --results-dir weekly_results_ci_operatorRun weekly with an explicit operator list (optional; generated from registry if omitted):
python tools/run_flagtensor_weekly.py --project-root . --op-list my_ops.txt --gpus 0 --mode kernel --results-dir weekly_results_ci- Tensor primitives have undergone performance tuning
- Triton kernel call optimization
- Flexible multi-backend support mechanism
- Support for common tensor primitives
pip install -U pip setuptools wheel
pip install torch triton pytest pyyaml matplotlib openpyxlgit clone https://github.com/flagos-ai/FlagTensor.git
cd FlagTensor
pip install -e .import torch
import flagtensor
# Create a tensor
x = torch.randn(1024, device="cuda", dtype=torch.float32)
# Apply ReLU operator
y = flagtensor.relu(x)This project is licensed under the Apache (Version 2.0) License.