-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Problem Description
A few remarks on issues I encountered while building TransformerEngine:
- Not sure of the state of
release_v2.1_rocmbranch, but it doesn't build with one of the latest bleeding edge internal ROCm builds (# 16274), due to issues with clang-20 (not sure this is a suitable place for the details, can provide more in private). Had to fallback to previousrelease_v1.4_rocm, which seems to not spawn errors, at least immediately.
(all below applies to release_v1.4_rocm branch or older)
-
Controlling the setup process via env vars may do not what a user thinks.
Some tests for env vars are invalid:if os.getenv('NVTE_USE_HIPBLASLT') is not None:only check if the var is defined, so when it's defined asNVTE_USE_HIPBLASLT=0it won't do what a user supposes it to do.- Something like
if int(os.getenv('NVTE_USE_HIPBLASLT','0')) != 0:would be much better.
- Something like
-
Default package build settings could spawn too many parallel compilation processes, causing the machine to run out of available physical memory and hang (since default Linux OOM handler is crap). This happened at least twice for me on 2 different servers, post-mortem of the first case is https://ontrack-internal.amd.com/browse/DCCS-2615 (this was ROCm 6.*). The second machine specs are Ubuntu22.04 on 224 logical cores CPU, 1Tb memory, ROCm 7.0.0 # 16274. Suggestions:
build_tools.utils.get_max_jobs_for_parallel_build()supports non-standardNVTE_BUILD_MAX_JOBSandMAX_JOBSenvironment variables, but it should also support standardCMAKE_BUILD_PARALLEL_LEVEL. Why? B/c as soon as a user will see the build is CMake based, that's their go-to choice to control parallelism level.- There should be a reasonable default value if none was specified by a user, assuming that each build process could take up to say 10-20Gib of physical memory. Otherwise default build settings could still cause crashes.
Operating System
Ubuntu 22.04, but likely affects all Linuxes at least.
CPU
in the text
GPU
doesn't matter
ROCm Version
Several. Current was 7.0.0 # 16274
ROCm Component
No response
Steps to Reproduce
pip install .
read the description.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response