Skip to content

[Issue]: Build system flaws report #215

@Arech8

Description

@Arech8

Problem Description

A few remarks on issues I encountered while building TransformerEngine:

  1. Not sure of the state of release_v2.1_rocm branch, but it doesn't build with one of the latest bleeding edge internal ROCm builds (# 16274), due to issues with clang-20 (not sure this is a suitable place for the details, can provide more in private). Had to fallback to previous release_v1.4_rocm, which seems to not spawn errors, at least immediately.

(all below applies to release_v1.4_rocm branch or older)

  1. Controlling the setup process via env vars may do not what a user thinks.
    Some tests for env vars are invalid: if os.getenv('NVTE_USE_HIPBLASLT') is not None: only check if the var is defined, so when it's defined as NVTE_USE_HIPBLASLT=0 it won't do what a user supposes it to do.

    • Something like if int(os.getenv('NVTE_USE_HIPBLASLT','0')) != 0: would be much better.
  2. Default package build settings could spawn too many parallel compilation processes, causing the machine to run out of available physical memory and hang (since default Linux OOM handler is crap). This happened at least twice for me on 2 different servers, post-mortem of the first case is https://ontrack-internal.amd.com/browse/DCCS-2615 (this was ROCm 6.*). The second machine specs are Ubuntu22.04 on 224 logical cores CPU, 1Tb memory, ROCm 7.0.0 # 16274. Suggestions:

    1. build_tools.utils.get_max_jobs_for_parallel_build() supports non-standard NVTE_BUILD_MAX_JOBS and MAX_JOBS environment variables, but it should also support standard CMAKE_BUILD_PARALLEL_LEVEL. Why? B/c as soon as a user will see the build is CMake based, that's their go-to choice to control parallelism level.
    2. There should be a reasonable default value if none was specified by a user, assuming that each build process could take up to say 10-20Gib of physical memory. Otherwise default build settings could still cause crashes.

Operating System

Ubuntu 22.04, but likely affects all Linuxes at least.

CPU

in the text

GPU

doesn't matter

ROCm Version

Several. Current was 7.0.0 # 16274

ROCm Component

No response

Steps to Reproduce

pip install .
read the description.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions