Description
My organization requires using RHEL8 (a supported OS) from the privately shared RedHat licensed base. We then use pcluster build-image
to make it ready for ParallelCluster.
The pcluster build-image
task has started failing for us recently. The initial AMI starts with RHEL-8.8
(I also tried 8.7
, but is updated to RHEL 8.9
from the redhat-release
RPM during build:
EVENTS 1700589295187 Step UpdateOS 1700589294393
EVENTS 1700589295187 ExecuteBash: STARTED EXECUTION 1700589294395
[...]
EVENTS 1700589326128 Stdout: redhat-release x86_64 8.9-0.1.el8 rhel-8-baseos-rhui-rpms 45 k 1700589326002
That comes from the UpdateOS
section of the playbook:
121 - name: UpdateOS
122 action: ExecuteBash
123 inputs:
124 commands:
125 - |
126 set -v
127 OS='{{ build.OperatingSystemName.outputs.stdout }}'
128 PLATFORM='{{ build.PlatformName.outputs.stdout }}'
129
130 if [[ ${!PLATFORM} == RHEL ]]; then
131 yum -y update
[...]
The yum -y update
brings the OS to all most recent packages, including redhat-release
and kernel-*
.
The failure occurs later, during a kernel_module 'lnet'
: https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.7.2/cookbooks/aws-parallelcluster-environment/resources/lustre/partial/_install_lustre_centos_redhat.rb#L36
EVENTS 1700590745016 Stdout: [2023-11-21T18:19:01+00:00] INFO: dnf_package[kmod-lustre-client, lustre-client, dracut] installed ["kmod-lustre-client", "lustre-client", nil] at ["0:2.12.8-1.fsx7.el8.x86_64", "0:2.12.8-1.fsx7.el8.x86_64", nil] 1700590741488
EVENTS 1700590745016 Stdout: - install version 0:2.12.8-1.fsx7.el8.x86_64 of package kmod-lustre-client 1700590741488
EVENTS 1700590745016 Stdout: - install version 0:2.12.8-1.fsx7.el8.x86_64 of package lustre-client 1700590741488
EVENTS 1700590745016 Stdout: * kernel_module[lnet] action install[2023-11-21T18:19:04+00:00] INFO: Processing kernel_module[lnet] action install ((eval) line 36) 1700590744740
EVENTS 1700590745016 Stdout: ================================================================================ 1700590744770
EVENTS 1700590745016 Stdout: Error executing action `install` on resource 'kernel_module[lnet]' 1700590744770
EVENTS 1700590745016 Stdout: ================================================================================ 1700590744770
EVENTS 1700590745016 Stdout: Mixlib::ShellOut::ShellCommandFailed 1700590744770
EVENTS 1700590745016 Stdout: ------------------------------------ 1700590744770
EVENTS 1700590745016 Stdout: Expected process to exit with [0], but received '1' 1700590744770
EVENTS 1700590745016 Stdout: ---- Begin output of modprobe lnet ---- 1700590744770
EVENTS 1700590745016 Stdout: STDOUT: 1700590744770
EVENTS 1700590745016 Stdout: STDERR: modprobe: FATAL: Module lnet not found in directory /lib/modules/4.18.0-513.5.1.el8_9.x86_64 1700590744770
That is, there's no module in /lib/modules/4.18.0-513.5.1.el8_9.x86_64
.
The kernel matrix compability in this document https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html indeed doesn't mention 4.18.0-513
, and the upstream at https://downloads.whamcloud.com/public/lustre/latest-2.12-release/el8/client/ doesn't include it either. So I realize this is actually a Lustre packaging issue, but I'm not sure how to get in touch with the FSX Lustre team. Even so, it'd be great to have a workaround. Right now we can't use new AMIs for compute nodes.
I'm unsure of the best way forward here - blacklist redhat-release*
and/or kernel-*
from build-image
process? Ignore errors from modprobe lnet
?