Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
1259663
use v6 VM SKU
anson627 Aug 27, 2025
8631661
increase os disk
anson627 Aug 27, 2025
e3c0026
Add Doca/Mofed driver to GB200 image
anson627 Aug 26, 2025
16f319d
Add option to install DCGM, dcgm-exporter and the NVIDIA drivers for …
keith-ms Aug 13, 2025
8adb303
Fix nvidia list file
keith-ms Aug 13, 2025
e775baf
Fix nvidia-driver package name and version
keith-ms Aug 13, 2025
cd5a832
Fix incomplete systemd command
keith-ms Aug 13, 2025
f9f4ace
Pin to specific versions of NVIDIA packages correctly
keith-ms Aug 14, 2025
8eb34e7
Remove specific versions due to package installation failures
keith-ms Aug 14, 2025
3013a95
Fix NVIDIA package names
keith-ms Aug 14, 2025
e5480dc
Increase systemd log level to debug to troubleshoot boot errors
keith-ms Aug 18, 2025
b3f64ad
Revert systemd debugging kernel parameter
keith-ms Aug 19, 2025
c3c4c3f
Add label limit check to prevent kubelet from crashing
keith-ms Aug 20, 2025
076e3c1
Add additional NVIDIA dependencies to the GB200 image
keith-ms Aug 20, 2025
8013bf1
Add nvidia-container-toolkit configuration to containerd
keith-ms Aug 20, 2025
1407679
Fix CUDA package name
keith-ms Aug 20, 2025
d5761cc
Move label value truncation logic to types.go
keith-ms Aug 21, 2025
d881f97
Revert validation logic in Go code, add validation before KUBELET_NOD…
keith-ms Aug 22, 2025
4d70ddc
Revert validation, shorten artifact name
keith-ms Aug 22, 2025
b50159a
Revert changes, add truncation logic to cse_config.sh before node lab…
keith-ms Aug 22, 2025
f5cba9a
Add ExecStartPre script to trim KUBELET_NODE_LABELS, revert previous …
keith-ms Aug 22, 2025
c27135d
Remove changes related to truncating node label, specifying version n…
keith-ms Aug 22, 2025
61f8534
Revert changes made for truncating node label
keith-ms Aug 22, 2025
1fc3f68
Add k8s-device-plugin installation and enable with systemd
keith-ms Aug 27, 2025
20ff30b
Add oneshot systemd service to write nvidia-specific containerd file
keith-ms Aug 28, 2025
f4d9f95
Modify script and service file to fix dependency problem
keith-ms Aug 28, 2025
b71d82b
Require containerd service to run before replacing file
keith-ms Aug 28, 2025
af4fed4
Change dependency for containerd-nvidia-config service
keith-ms Aug 29, 2025
1cdc941
feat: add blacklist nouveau drivers, udev char rules, some comments (…
abenn135 Aug 29, 2025
e97d076
Move oneshot script to later in the boot process
keith-ms Aug 29, 2025
43f9c13
Remove DefaultDependencies=no, restore After=aks-log-collector.service
keith-ms Sep 2, 2025
17ffe0d
Make sure the nvidia config is written before the device plugin starts
keith-ms Sep 2, 2025
7c72693
Attempt to write containerd configuration directly after making modif…
keith-ms Sep 3, 2025
47c1001
Modify nvidia-device-plugin service to run after kubelet
keith-ms Sep 3, 2025
2532107
Remove reference to deleted files
keith-ms Sep 3, 2025
8bc843f
Add ExecStartPre to make sure the device-plugins directory exists
keith-ms Sep 4, 2025
79f1f54
Remove OFED driver installation to troubleshoot VHD build failure aro…
keith-ms Sep 4, 2025
d4f6af8
Fix formatting for package installation
keith-ms Sep 4, 2025
f708694
Break up apt commands to install kernel-related packages separately
keith-ms Sep 4, 2025
e8b0048
Enable openidb service
keith-ms Sep 4, 2025
952d0d8
Replace MLNX_OFED with DOCA-OFED
keith-ms Sep 4, 2025
084256b
Keith ms/add nvidia module parameters (#7071)
keith-ms Sep 22, 2025
ec33d3d
Remove reference to ARM64 check in GB200 image, only set 60GB volume …
keith-ms Oct 7, 2025
3da9641
Split out GB200 build into new packer file
keith-ms Oct 7, 2025
781dc5a
Remove specific package version specification for the driver, bump to…
keith-ms Oct 8, 2025
1a910a5
Add the ability to specify a custom local repository to download (#7186)
keith-ms Oct 16, 2025
0e4851f
tidy up messy merge -- remove bad mellanox deb repo reference
abenn135 Oct 24, 2025
44faf9b
fix shell test failure. Glob matching is tidier in bash
abenn135 Oct 24, 2025
a6fb228
add recent additions to vhd/packer/vhd-image-builder-arm64-gen2.json
abenn135 Oct 27, 2025
d9cf9e4
Address some nits: re-enable CIS reports, follow doca latest, add new…
abenn135 Oct 28, 2025
720979a
bug: add DCGM CUDA13-compatible packages to GB200 image. Also add mul…
abenn135 Oct 31, 2025
40e86cb
chore: revert bad aks-node-controller changes (#7336)
cameronmeissner Nov 7, 2025
dce944d
Revert "chore: revert bad aks-node-controller changes (#7336)" (#7337)
cameronmeissner Nov 8, 2025
c39f4f8
chore: final fix for gb200 release branch (#7338)
cameronmeissner Nov 8, 2025
38f7c17
Revert "chore: final fix for gb200 release branch (#7338)"
abenn135 Nov 17, 2025
6cf8d84
Remove leftover Ubuntu 1604 logic carried over into gb200 packer temp…
abenn135 Nov 17, 2025
34f7697
feat: exact BOM versions (#7393)
abenn135 Nov 19, 2025
6b78d98
Add more explicit deps to the GB200 BOM.
abenn135 Dec 3, 2025
c1a8898
fix: add libnvidia-container1 to bom
abenn135 Dec 3, 2025
416b2f4
fix: allow downgrades. I don't know why we're installing ib packages …
abenn135 Dec 3, 2025
06c7c53
Merge changes from `main` into `release-gb200` (#7491)
keith-ms Dec 10, 2025
7c1dc52
Merge main changes into release-gb200 branch rolling up through Jan 2…
abenn135 Jan 28, 2026
61731d0
Abenn135/gb300 bom (#7748)
abenn135 Jan 28, 2026
321c28e
fix: copy over image builder files and configuration from main arm64 …
abenn135 Jan 28, 2026
4679e08
Downgrade GBx00 kernel to 6.14.0-1003. (#7769)
abenn135 Feb 4, 2026
5322115
feat: disable NPD for MAI VHD. (#7843)
abenn135 Feb 10, 2026
047b0e6
fix: make node-problem-detector.d directory before trying to touch th…
abenn135 Feb 11, 2026
80e86ff
Bond local NVMe drives together via software RAID. Run kubelet root t…
abenn135 Mar 18, 2026
27e79dc
Reassemble /dev/md0 correctly (#8130)
keith-ms Mar 20, 2026
9b12560
Restore incorrectly deleted file during cherry-pick
keith-ms May 14, 2026
3936e6e
Remove poorly merged stage and Windows configuration
keith-ms May 14, 2026
28cab47
Remove malformed pipeline definition
keith-ms May 14, 2026
7cdea46
Fix wrongly merged e2e file
keith-ms May 14, 2026
286f532
Remove extra apt-clean
keith-ms May 14, 2026
db3c66f
Restore incorrectly merged file
keith-ms May 14, 2026
b0de98d
Remove global APT setting
keith-ms May 14, 2026
8ed1305
Reorder file movement in packer_source file for GB200 feature flag
keith-ms May 14, 2026
f28b9e4
Fix kernel installation logic for GB* platform
keith-ms May 14, 2026
021b8b0
Remove reference to teleportd service file
keith-ms May 14, 2026
a76e7ec
Add missing files referenced in packer_source.sh
keith-ms May 14, 2026
352087a
Update package version for libcap2-bin
keith-ms May 14, 2026
dfc1db0
Increase disk size for GB200 image
keith-ms May 15, 2026
3e0d69b
Fix waagent script invocation
keith-ms May 15, 2026
8e9352f
Fix containerd overwrite problem
keith-ms May 15, 2026
30dedb6
fix: address GB200 review feedback in dependency install logic
Copilot May 15, 2026
332acf0
fix: restore GB200 NPD skip marker behavior
Copilot May 15, 2026
eeb56b8
chore: clarify NPD skip marker comment wording
Copilot May 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pipelines/.vsts-vhd-builder-release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -975,7 +975,7 @@ stages:
echo '##vso[task.setvariable variable=IMG_SKU]server-arm64'
echo '##vso[task.setvariable variable=IMG_VERSION]latest'
echo '##vso[task.setvariable variable=HYPERV_GENERATION]V2'
echo '##vso[task.setvariable variable=AZURE_VM_SIZE]Standard_D16pds_v5'
echo '##vso[task.setvariable variable=AZURE_VM_SIZE]Standard_D32pds_v5'
echo '##vso[task.setvariable variable=FEATURE_FLAGS]GB200'
echo '##vso[task.setvariable variable=ARCHITECTURE]ARM64'
echo '##vso[task.setvariable variable=ENABLE_FIPS]False'
Expand Down
4 changes: 3 additions & 1 deletion .pipelines/templates/.builder-release-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ steps:
BUILD_ID: $(Build.BuildId)
BUILD_DEFINITION_NAME: $(Build.DefinitionName)
UA_TOKEN: $(ua-token)
LOCAL_DOCA_REPO_URL: $(LOCAL_DOCA_REPO_URL)
CONTINUE_ON_LOCAL_REPO_DOWNLOAD_ERROR: $(CONTINUE_ON_LOCAL_REPO_DOWNLOAD_ERROR)

- task: AzureCLI@2
inputs:
Expand Down Expand Up @@ -350,7 +352,7 @@ steps:
TargetFolder: '$(Build.ArtifactStagingDirectory)'

- task: CopyFiles@2
condition: and(eq(variables.OS_SKU, 'Ubuntu'), in(variables.OS_VERSION, '22.04', '24.04'), in(variables.FEATURE_FLAGS, 'None', 'cvm'))
condition: and(eq(variables.OS_SKU, 'Ubuntu'), in(variables.OS_VERSION, '22.04', '24.04'), in(variables.FEATURE_FLAGS, 'None', 'cvm', 'GB200'))
displayName: Copy CIS Reports
inputs:
SourceFolder: '$(System.DefaultWorkingDirectory)'
Expand Down
5 changes: 5 additions & 0 deletions packer.mk
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,13 @@ build-packer: setup-golang generate-prefetch-scripts build-image-fetcher build-a
ifeq (${ARCHITECTURE},ARM64)
@echo "${MODE}: Building with Hyper-v generation 2 ARM64 VM"
ifeq (${OS_SKU},Ubuntu)
ifeq ($(findstring GB200,$(FEATURE_FLAGS)),GB200)
Comment thread
keith-ms marked this conversation as resolved.
@echo "Using packer template file vhd-image-builder-arm64-gb200.json"
@packer build -timestamp-ui -var-file=vhdbuilder/packer/settings.json vhdbuilder/packer/vhd-image-builder-arm64-gb200.json
else
@echo "Using packer template file vhd-image-builder-arm64-gen2.json"
@packer build -timestamp-ui -var-file=vhdbuilder/packer/settings.json vhdbuilder/packer/vhd-image-builder-arm64-gen2.json
endif
else ifeq (${OS_SKU},CBLMariner)
@echo "Using packer template file vhd-image-builder-mariner-arm64.json"
@packer build -timestamp-ui -var-file=vhdbuilder/packer/settings.json vhdbuilder/packer/vhd-image-builder-mariner-arm64.json
Expand Down
69 changes: 56 additions & 13 deletions parts/linux/cloud-init/artifacts/cse_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -345,22 +345,27 @@ LimitNOFILE=1048576
EOF

mkdir -p /etc/containerd
# Remove in case this is an existing symlink
rm -f /etc/containerd/config.toml
if [ "${GPU_NODE}" = "true" ]; then
# Check VM tag directly to determine if GPU drivers should be skipped
export -f should_skip_nvidia_drivers
should_skip=$(should_skip_nvidia_drivers)
if [ "$?" -eq 0 ] && [ "${should_skip}" = "true" ]; then
echo "Generating non-GPU containerd config for GPU node due to VM tags"
echo "${CONTAINERD_CONFIG_NO_GPU_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT

if grep -q 'BinaryName = "/usr/bin/nvidia-container-runtime"' /etc/containerd/config.toml 2>/dev/null; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an explicit GB 200/300 check here as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens at runtime, not build time, and I don't believe there's a way to check this at that point (the feature flag isn't present, though perhaps you can by SKU via IMDS, but at the risk of failure due to IMDS access problems).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Could also consider adding a sentinel flag during build time, such as /etc/aks/gpu-config-baked.marker, which can be checked here. Just an optional suggestion

echo "NVIDIA containerd config already exists at /etc/containerd/config.toml, skipping generation"
else
# Remove in case this is an existing symlink or non-NVIDIA config
rm -f /etc/containerd/config.toml
if [ "${GPU_NODE}" = "true" ]; then
# Check VM tag directly to determine if GPU drivers should be skipped
export -f should_skip_nvidia_drivers
should_skip=$(should_skip_nvidia_drivers)
if [ "$?" -eq 0 ] && [ "${should_skip}" = "true" ]; then
echo "Generating non-GPU containerd config for GPU node due to VM tags"
echo "${CONTAINERD_CONFIG_NO_GPU_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT
else
echo "Generating GPU containerd config..."
echo "${CONTAINERD_CONFIG_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT
fi
else
echo "Generating GPU containerd config..."
echo "Generating containerd config..."
echo "${CONTAINERD_CONFIG_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT
fi
else
echo "Generating containerd config..."
echo "${CONTAINERD_CONFIG_CONTENT}" | base64 -d > /etc/containerd/config.toml || exit $ERR_FILE_WATCH_TIMEOUT
fi

export -f should_e2e_mock_azure_china_cloud
Expand Down Expand Up @@ -634,6 +639,44 @@ ensurePodInfraContainerImage() {
rm -f ${POD_INFRA_CONTAINER_IMAGE_TAR}
}

validateKubeletNodeLabels() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn’t currently used anywhere in the call path. I confirmed there are no references to validateKubeletNodeLabels in-tree while working this update (30dedb6).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used. I can remove it, but it probably should be integrated because a label over 63 characters will cause kubelet to fail to start.

local labels="$1"
local validated_labels=""
local delimiter=""

# Return empty if no labels provided
if [ -z "$labels" ]; then
echo "No labels found in KUBELET_NODE_LABELS"
return 0
fi

# Split labels by comma and process each
IFS=',' read -ra LABEL_ARRAY <<< "$labels"
for label in "${LABEL_ARRAY[@]}"; do
# Split each label into key and value
# shellcheck disable=SC3010
if [[ "$label" == *"="* ]]; then
key="${label%%=*}"
value="${label#*=}"

# Check if key length exceeds 63 characters
if [ ${#key} -gt 63 ]; then
echo "Warning: Label key '$key' exceeds 63 characters, truncating to 63 characters" >&2
key="${key:0:63}"
fi

# Rebuild the label with potentially truncated key
validated_labels="${validated_labels}${delimiter}${key}=${value}"
fi

# Set delimiter for subsequent labels
delimiter=","
done

# Update the global variable with validated labels
KUBELET_NODE_LABELS="$validated_labels"
}

ensureKubelet() {
KUBELET_DEFAULT_FILE=/etc/default/kubelet
mkdir -p /etc/default
Expand Down
39 changes: 39 additions & 0 deletions parts/linux/cloud-init/artifacts/ubuntu/containerd-nvidia.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
oom_score = -999
version = 2

[metrics]
address = "0.0.0.0:10257"

[plugins]

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "mcr.microsoft.com/oss/kubernetes/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/bin/runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.untrusted]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.untrusted.options]
BinaryName = "/usr/bin/runc"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"

[plugins."io.containerd.grpc.v1.cri".registry.headers]
X-Meta-Source-Client = ["azure/aks"]
2 changes: 2 additions & 0 deletions parts/linux/cloud-init/artifacts/ubuntu/doca.list
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not gonna block on this, though I'd prefer we move these artifacts that are only uploaded to GB200/300 VHDs into a subfolder, maybe called graceblackwell or something

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
deb [arch=arm64 signed-by=/etc/apt/keyrings/doca-net.pub] https://linux.mellanox.com/public/repo/doca/latest/ubuntu24.04/arm64-sbsa/ ./
deb [arch=amd64 signed-by=/etc/apt/keyrings/doca-net.pub] https://linux.mellanox.com/public/repo/doca/latest/ubuntu24.04/x86_64/ ./
Comment thread
keith-ms marked this conversation as resolved.
81 changes: 81 additions & 0 deletions parts/linux/cloud-init/artifacts/ubuntu/doca.pub
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v2.0.14 (GNU/Linux)

mQGiBFMEmE0RBACsz1qcFsYOs0LHy/pBR2ip0gnHYbZgLy00R2i7cELxmqGcESzp
6IfzIdwOX9oVsPI6NT/yvftp+BxALuD8UC52MLjdMJZ+1sXBZM4J5xnDmQMhIp0G
wCse8usM8Zad1WTKq+P0ip8Gd17WEpfwMQPKXg3npcF69zaz/ceeDavqjwCgofU0
rb8ui7cZs+c+7U+5mrXxmcMD/R/tV8tEykQFW7PKuZ9NvvRX2XFuQD9LZRW7v+Rg
ebC0GAM1ZSqgI7uNUL3ZLAMgxaURLZViqKPgiw8373uoayfrnccttoZ2prHdtB5O
ZPo9vp8wJYUd+Wug2c1nuzXQtTrs/wfeJDn/PfvlEIGlXYPphsBXGQd7MbMLtW7g
u6h/A/9lmSP1fFQflTRlO5j3jXrlFkW05lMlWVZD3H75obQxHlM7eGCgnUPABBMt
aoZDZDf5P9I3xinu9qhDi7Vbz7QOkWOGr2dHLUOMqIgoKz7zRcFtbAl65AcOuEKu
KpLE/R3mRjZ7vrCPud6euEKGpvMbdevDF7GeMG3fcvVlK1ivy7RVTWVsbGFub3gg
VGVjaG5vbG9naWVzIChNZWxsYW5veCBUZWNobm9sb2dpZXMgLSBTaWduaW5nIEtl
eSB2MikgPHN1cHBvcnRAbWVsbGFub3guY29tPohiBBMRAgAiBQJTBJhNAhsDBgsJ
CAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRDF7YPiYiTAUFcAAJ49FBA3hy0P0gsZ
q/ZkAMrgXZaG9wCcDjMtZZETG5NEaIVg3GYqJcvI4AW5AQ0EUwSYTRAEANmBQ0WP
O3VsOrDH0VX+fa1nuKpTqyPFmrROtiI0Ux1dEsU/hpFJnFHtv+CW8ppUlMmjhw6U
olS3dqvO+fWxe1FMLVpp1BQLI6udM5j/P1IEDH7TmZD5trYFp4PxXagKO2nBeqjj
NydQckgREntGCOGPqheBRdopmlJSPlTptQavAAMFA/9BVSpmStx3BsS0z5NPSI/V
wJFeQiXFq8zDKbEVHFMjYWGqbhGWDPaLJWxxNLF1hdpbZSQCAeaESNLYG0iqXwb6
6O79BHpGeN0AWyy2J6FJpt0zwlCDfx7fgpFKMGzIxXWiTDNmKon241ojgM1iYC2o
arjropoA0dtG6noS2KJBYIhJBBgRAgAJBQJTBJhNAhsMAAoJEMXtg+JiJMBQzxUA
oJ+aJ2l6vt1S1tIKCLVtDMH8liOBAJ45EQ867jkf6f2Anihx9XJ0LLKZvw==
=QMd9
-----END PGP PUBLIC KEY BLOCK-----
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v2.0.14 (GNU/Linux)

mQGiBFIHkboRBADGcZ0FQvQl8frNzEZIep6D+KSZY/ps70+k3ZJ+wj2mvtGZSV9t
zeEUbte7ft5HzrIniB87j1Swp+mSJIomLTkOcQunoqCCHQkuPOEMi1urUmdjpyc7
nJjsQ63GLvH0DfmknGga4rCj3Kepn9mhJ9mqfS+/aXrz1ZP4Dk+alpi/RwCgplxo
94IruAMKoQCdJ3SmfqvszYcEAMUJ3qmCpYax4s/0XyX36emLiMioHZehq/QXdFmj
VmqqxL5QFmq9Yof8SwGBwpS8FS0VX8BTs7xAs5W7ZC7iGGo9uxuXZzeZ8vcwh6VX
OVmbtqLgXyPKqzHIDwJ8Q5Df0JQpRnCmQQaHbEcoOstSTP/3NHLFBIllPq7gqIpZ
9HQoBACvlwzvtabC9q1OAikXY5YKKbAtkmZYBa5I2qvfHV1bIRYPPHWW2shilX0N
Kz2pTR1ZlwEcz+CUhPtJgoWhkMu/Vl7NMeB0YzGmjQorHRj2mAvSbv/wvjeIMgbw
qRXIksGYiUSpTLtQYTfpJlNe0ZKzn6kHbqGUYZ92Jx2ki3gQqbQsTWVsbGFub3gg
VGVjaG5vbG9naWVzIDxzdXBwb3J0QG1lbGxhbm94LmNvbT6IYgQTEQIAIgUCUgeR
ugIbAwYLCQgHAwIGFQgCCQoLBBYCAwECHgECF4AACgkQAQSPp6nktkPsXwCfW2Rn
pgmC4zLTMBRo/hKsIvag2ToAnAtlzxpMAZGUQHBODfpGqx7MyHmUuQENBFIHkboQ
BADd2OqEdSDCB6KkgZ2BjURxpiDbZxEAEsTJOUBFMPSqdJN0GcqUon5Hc3yADDOF
ztdWf5XCKSp/loYvjTYM21Qq20g5EB2SU9FU6Eoq5vyU/HS3/c1wjiYv2rjMll62
kc4oqRkM/fp9crrjArssfqMQcQRVYBS3dYdmoVdpHEH68wADBQP/XPW9r3wwGvUr
7hlFskYrSC/8s3r7vB4/mcF6UMkM4xEaP3jq8HH0SLkLbcPTa1+C/5evhmLbT12f
dub/V0/JVT9YsxS3anmvefT6EXjUntYXDLPhhRJqUCnxYjf95FX5zxudB5gMEwLh
9pmRMgqMCDsIANVv7V77DagfaWNkhqSISQQYEQIACQUCUgeRugIbDAAKCRABBI+n
qeS2Q71kAJ45i6YdS9bZGR8tDI0NfneMiU32CwCfdje+fgX5gUtag5SshjxyMrgt
DgY=
=z9pR
-----END PGP PUBLIC KEY BLOCK-----
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1

mQENBFpbc0cBCADDST+ekKD1YJje77oDX94gRolmUlh0df4n6/xvE700M1vPAiTT
kU3WJcvwnuTZpyMGSsAQCXXQRJuQObnkPEvjVAPgh8fvghCXgVElcr6dqXu3EVze
iCkdYm08t/+FF3kg/P6VYPjgEM/GIFnKTz37LrQlUM4ArG0ENIYM9xjurnKWuV9r
JuckJcUsmZUS/D9QMM2fuurYOEWHrE8t+n2EcO4aoY2x0ogYce0vON539rJiskjz
OPhIB9G7ZFQabQnyxzEKiUUDyJsbe38XDT4eyjUR2mlHGgTY/WzGdDEtIKRBWsd3
TV3wXt42nF9YA3oieeaTbIluyywNnOj1vyT1ABEBAAG0VU1lbGxhbm94IFRlY2hu
b2xvZ2llcyAoTWVsbGFub3ggVGVjaG5vbG9naWVzIC0gU2lnbmluZyBLZXkgdjMp
IDxzdXBwb3J0QG1lbGxhbm94LmNvbT6JATcEEwEIACEFAlpbc0cCGwMFCwkIBwMF
FQoJCAsFFgIDAQACHgECF4AACgkQoCT28ObWooFXYwgAunwBFELGlwKonnmnbi4/
avUa8e0wRpww//DJjI0HQWjMk7oPLDbS50CVps1Mu0SxBAPYGtsFeSH6UMC6A0K4
yoxXICVl409vYkycNu/vq6eLTbM2Y0PFvBDzRAf3rJXL0ApLuUb57ARZvc7Np7LA
v8K53PdOJUEFns8Ipp+2puEVx5dfezm7LwRca6ohoLUEdI/PobmGUeNvO5dvfiix
LvSVw2A2awihB7dcs5cpo57VxBWPs7+sYBZ0+EUJbtQEiHAyPvKs29nMeaCIwPTd
88A5RrhsEJx+QWXuG6NA4rfehy5e9j1PW3XnC2fMl6w7gNLY5I8Vq6c2MJ73NZ6y
wLkBDQRaW3NHAQgAynkQ+mf4f5cdM4/bJuRWlPxxuN3CUxN9Q6B5B1/13p6tkydP
C7S4ro8H8sSlO5FbbxihfZLPTbFNrBkd///OQYMJW/slbtT6D9dYmCIeuHObMEMb
V+Bn1bWQId2vZgr0+m0Xe3K+KqhsylsrmC1ebShMnny/V+MlOQQt+L089BNiyCB4
70mhgM1NiJFv9EOQlXWWaMqWTxZGYkdOuFW0q8NnSGOqI5xjrAUxaHZ/1U3yPy0k
eAjX1AKJngaj86SvIzEefxq4oA2gZ8UFVO/qFH5OhfoovrEwudJEuIgGb76XOb9m
AoZlAqQLJniC97ld515ivBdSi4SZkaFbypnX4QARAQABiQEfBBgBCAAJBQJaW3NH
AhsMAAoJEKAk9vDm1qKBHhMIAJuGbb6S3nb2xAD3GjB8F2xNcZxWQ+Qz70DY5vV/
WhrJl7cknXMxsbWvQupuYk6LujZraG9YoD4csZ5o+k3s3BGKVUXdZdhjaHpcAa5F
X12ADLHca5mlmdCaaORYXQ+xHYRlOKas4I6LPpZ79BauVomEnPcv/bL0kGFzDvLr
K3RdQ1n/pbcWcxxSY3InphAnslLUg0PTAME6Yay5F7WrJsnZnXApUjOlZvlPIl2c
iplivN8o85eBKQXvYRg/c5iyc0koTmkM6OXNvUy0hV9z8WhhK9O+ApXwMUMf43DS
KOIg9RxhZFQoPXptaQZDLz89sWmZaiXsyBPJyjlmaTjwHGM=
=Iy5R
-----END PGP PUBLIC KEY BLOCK-----
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[Unit]
Requires=format-mount-nvme-root.service
After=format-mount-nvme-root.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[Unit]
Description=Format NVMe local disk and mount Kubelet there
Requires=mnt.mount
After=mnt.mount

[Service]
Restart=on-failure
RemainAfterExit=yes
Type=oneshot
ExecStart=/bin/bash /opt/azure/containers/format-mount-nvme-root.sh

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
set -x

# Bind mount kubelet to local NVMe storage specifically on startup.
MOUNT_POINT="/mnt/aks"


KUBELET_MOUNT_POINT="${MOUNT_POINT}/kubelet"
KUBELET_DIR="/var/lib/kubelet"

mkdir -p "${MOUNT_POINT}"

SENTINEL_FILE="/opt/azure/containers/bind-sentinel"
if [ ! -e "${SENTINEL_FILE}" ]; then
# Bond (via software RAID) and format the NVMe disks if that's not already done.
if [ -e /dev/disk/azure/local/by-index/1 ] && [ ! -e /dev/md0 ]; then
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/disk/azure/local/by-index/1 /dev/disk/azure/local/by-index/2 /dev/disk/azure/local/by-index/3 /dev/disk/azure/local/by-index/4
mkfs.ext4 -F /dev/md0
# Save the RAID config so mdadm --assemble --scan works on subsequent boots.
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
fi
mount /dev/md0 "${MOUNT_POINT}"
mv "${KUBELET_DIR}" "${KUBELET_MOUNT_POINT}"
touch "${SENTINEL_FILE}"
else
# On subsequent boots, reassemble the RAID array from superblocks.
# Cannot use /dev/disk/azure/local/by-index/ paths here as the waagent
# udev rules that create those symlinks may not have run yet.
if [ ! -e /dev/md0 ]; then
mdadm --assemble --scan
fi
mount /dev/md0 "${MOUNT_POINT}"
fi

# on every boot, bind mount the kubelet directory back to the expected
# location before kubelet itself may start.
mkdir -p "${KUBELET_DIR}"
mount --bind "${KUBELET_MOUNT_POINT}" "${KUBELET_DIR}"
chmod a+w "${KUBELET_DIR}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
options nvidia NVreg_RestrictProfilingToAdminUsers=0
options nvidia NVreg_CreateImexChannel0=1
options nvidia NVreg_CoherentGPUMemoryMode=driver
options nvidia NVreg_RegistryDwords="RMBug5172204War=4"
2 changes: 2 additions & 0 deletions parts/linux/cloud-init/artifacts/ubuntu/nvidia-2404.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
deb [arch=amd64 signed-by=/etc/apt/keyrings/nvidia.pub] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 /
deb [arch=arm64 signed-by=/etc/apt/keyrings/nvidia.pub] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa /
29 changes: 29 additions & 0 deletions parts/linux/cloud-init/artifacts/ubuntu/nvidia.pub
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v2.0.22 (GNU/Linux)

mQINBGJYmlEBEAC6nJmeqByeReM+MSy4palACCnfOg4pOxffrrkldxz4jrDOZNK4
q8KG+ZbXrkdP0e9qTFRvZzN+A6Jw3ySfoiKXRBw5l2Zp81AYkghV641OpWNjZOyL
syKEtST9LR1ttHv1ZI71pj8NVG/EnpimZPOblEJ1OpibJJCXLrbn+qcJ8JNuGTSK
6v2aLBmhR8VR/aSJpmkg7fFjcGklweTI8+Ibj72HuY9JRD/+dtUoSh7z037mWo56
ee02lPFRD0pHOEAlLSXxFO/SDqRVMhcgHk0a8roCF+9h5Ni7ZUyxlGK/uHkqN7ED
/U/ATpGKgvk4t23eTpdRC8FXAlBZQyf/xnhQXsyF/z7+RV5CL0o1zk1LKgo+5K32
5ka5uZb6JSIrEPUaCPEMXu6EEY8zSFnCrRS/Vjkfvc9ViYZWzJ387WTjAhMdS7wd
PmdDWw2ASGUP4FrfCireSZiFX+ZAOspKpZdh0P5iR5XSx14XDt3jNK2EQQboaJAD
uqksItatOEYNu4JsCbc24roJvJtGhpjTnq1/dyoy6K433afU0DS2ZPLthLpGqeyK
MKNY7a2WjxhRmCSu5Zok/fGKcO62XF8a3eSj4NzCRv8LM6mG1Oekz6Zz+tdxHg19
ufHO0et7AKE5q+5VjE438Xpl4UWbM/Voj6VPJ9uzywDcnZXpeOqeTQh2pQARAQAB
tCBjdWRhdG9vbHMgPGN1ZGF0b29sc0BudmlkaWEuY29tPokCOQQTAQIAIwUCYlia
UQIbAwcLCQgHAwIBBhUIAgkKCwQWAgMBAh4BAheAAAoJEKS0aZY7+GPM1y4QALKh
BqSozrYbe341Qu7SyxHQgjRCGi4YhI3bHCMj5F6vEOHnwiFH6YmFkxCYtqcGjca6
iw7cCYMow/hgKLAPwkwSJ84EYpGLWx62+20rMM4OuZwauSUcY/kE2WgnQ74zbh3+
MHs56zntJFfJ9G+NYidvwDWeZn5HIzR4CtxaxRgpiykg0s3ps6X0U+vuVcLnutBF
7r81astvlVQERFbce/6KqHK+yj843Qrhb3JEolUoOETK06nD25bVtnAxe0QEyA90
9MpRNLfR6BdjPpxqhphDcMOhJfyubAroQUxG/7S+Yw+mtEqHrL/dz9iEYqodYiSo
zfi0b+HFI59sRkTfOBDBwb3kcARExwnvLJmqijiVqWkoJ3H67oA0XJN2nelucw+A
Hb+Jt9BWjyzKWlLFDnVHdGicyRJ0I8yqi32w8hGeXmu3tU58VWJrkXEXadBftmci
pemb6oZ/r5SCkW6kxr2PsNWcJoebUdynyOQGbVwpMtJAnjOYp0ObKOANbcIg+tsi
kyCIO5TiY3ADbBDPCeZK8xdcugXoW5WFwACGC0z+Cn0mtw8z3VGIPAMSCYmLusgW
t2+EpikwrP2inNp5Pc+YdczRAsa4s30Jpyv/UHEG5P9GKnvofaxJgnU56lJIRPzF
iCUGy6cVI0Fq777X/ME1K6A/bzZ4vRYNx8rUmVE5
=DO7z
-----END PGP PUBLIC KEY BLOCK-----
Loading
Loading