Skip to content

3.7.2: Lustre kmod modprobe breaks custom AMI based on RHEL8 #5913

Open
@nyetsche

Description

@nyetsche

My organization requires using RHEL8 (a supported OS) from the privately shared RedHat licensed base. We then use pcluster build-image to make it ready for ParallelCluster.

The pcluster build-image task has started failing for us recently. The initial AMI starts with RHEL-8.8 (I also tried 8.7, but is updated to RHEL 8.9 from the redhat-release RPM during build:

EVENTS  1700589295187   Step UpdateOS   1700589294393
EVENTS  1700589295187   ExecuteBash: STARTED EXECUTION  1700589294395

[...]

EVENTS  1700589326128   Stdout:  redhat-release                           x86_64  8.9-0.1.el8                    rhel-8-baseos-rhui-rpms       45 k 1700589326002

That comes from the UpdateOS section of the playbook:

121       - name: UpdateOS
122         action: ExecuteBash
123         inputs:
124           commands:
125             - |
126               set -v
127               OS='{{ build.OperatingSystemName.outputs.stdout }}'
128               PLATFORM='{{ build.PlatformName.outputs.stdout }}'
129
130               if [[ ${!PLATFORM} == RHEL ]]; then
131                 yum -y update
[...]

The yum -y update brings the OS to all most recent packages, including redhat-release and kernel-*.

The failure occurs later, during a kernel_module 'lnet': https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.7.2/cookbooks/aws-parallelcluster-environment/resources/lustre/partial/_install_lustre_centos_redhat.rb#L36

EVENTS  1700590745016   Stdout: [2023-11-21T18:19:01+00:00] INFO: dnf_package[kmod-lustre-client, lustre-client, dracut] installed ["kmod-lustre-client", "lustre-client", nil] at ["0:2.12.8-1.fsx7.el8.x86_64", "0:2.12.8-1.fsx7.el8.x86_64", nil]    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package kmod-lustre-client    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package lustre-client 1700590741488
EVENTS  1700590745016   Stdout:     * kernel_module[lnet] action install[2023-11-21T18:19:04+00:00] INFO: Processing kernel_module[lnet] action install ((eval) line 36)    1700590744740
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Error executing action `install` on resource 'kernel_module[lnet]'    1700590744770
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Mixlib::ShellOut::ShellCommandFailed  1700590744770
EVENTS  1700590745016   Stdout:       ------------------------------------  1700590744770
EVENTS  1700590745016   Stdout:       Expected process to exit with [0], but received '1'   1700590744770
EVENTS  1700590745016   Stdout:       ---- Begin output of modprobe lnet ----   1700590744770
EVENTS  1700590745016   Stdout:       STDOUT:   1700590744770
EVENTS  1700590745016   Stdout:       STDERR: modprobe: FATAL: Module lnet not found in directory /lib/modules/4.18.0-513.5.1.el8_9.x86_64  1700590744770

That is, there's no module in /lib/modules/4.18.0-513.5.1.el8_9.x86_64.

The kernel matrix compability in this document https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html indeed doesn't mention 4.18.0-513, and the upstream at https://downloads.whamcloud.com/public/lustre/latest-2.12-release/el8/client/ doesn't include it either. So I realize this is actually a Lustre packaging issue, but I'm not sure how to get in touch with the FSX Lustre team. Even so, it'd be great to have a workaround. Right now we can't use new AMIs for compute nodes.

I'm unsure of the best way forward here - blacklist redhat-release* and/or kernel-* from build-image process? Ignore errors from modprobe lnet?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions