Skip to content

Ensure that libcuda.so is in the ldcache #947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 24, 2025

Conversation

elezar
Copy link
Member

@elezar elezar commented Feb 27, 2025

These changes add a hook to create soname symlinks (e.g. libcuda.so.1 -> libcuda.so.RM_VERSION) to ensure that the libcuda.so -> libcuda.so.1 -> libcuda.so.RM_VERSION symlink chain exists when the ldcache is updated. This allows libcuda.so to be present in the ldcache when ldconfig is run.

Note that since we're adding a new hook, a generating client such as the k8s-device plugin or the k8s-dra-driver-gpu must be used with a nvidia-cdi-hook binary with a sufficient version.

See https://nvbugspro.nvidia.com/bug/5076022

@ArangoGutierrez
Copy link
Collaborator

/ok to test

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR adds an end-to-end test to verify that the libcuda symlink chain is properly present in the ldcache when running via Docker with NVIDIA runtime. Key changes include:

  • Addition of the "strings" import to support string splitting.
  • New test block that pulls an Ubuntu image and runs a container to inspect the ldcache output for libcuda.
  • Parsing and validation of the ldcache output to ensure both "libcuda.so" and "libcuda.so.1" are present.

Reviewed Changes

File Description
tests/e2e/nvidia-container-toolkit_test.go Added an end-to-end test to validate the presence of libcuda symlink entries in the ldcache

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from 6e7e1a8 to 44e6630 Compare February 27, 2025 13:25

// Create the 'create-soname-symlinks' command
c := cli.Command{
Name: "create-soname-symlinks",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska I elected to add a new hook entirely instead of modifying the existing update-ldcache. This is in keeping with "purpose-built hooks" and also means that the hook name can be used to indicate the intent.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one nit

@elezar elezar linked an issue Feb 27, 2025 that may be closed by this pull request
@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from 44e6630 to e43da10 Compare February 27, 2025 13:48
@elezar elezar added the must-backport The changes in PR need to be backported to at least one stable release branch. label Feb 27, 2025
@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch 3 times, most recently from daadba9 to 9daa179 Compare February 28, 2025 12:45
@elezar elezar requested review from klueska and jgehrcke February 28, 2025 12:52
ArangoGutierrez
ArangoGutierrez previously approved these changes Mar 5, 2025
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one clarifying question, but not a blocker.

"-N",
)
// Explicitly specific the directories to add.
args = append(args, dirs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does this create .so symlinks for all libraries present in the specified directories? Does this differ from the behavior of the legacy libnvidia-container implementation, which IIRC would only create the .so symlinks for a small list of libraries (like libcuda.so)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the .so symlinks, but the SONAME symlinks i.e. libcuda.so.1 -> libcuda.so.RM_VERSION in the case of libcuda. The .so symlinks are created using the "standard" create-symlinks hook.

@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from 9daa179 to 035abe0 Compare March 6, 2025 08:30
Copy link

copy-pr-bot bot commented Mar 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar removed the must-backport The changes in PR need to be backported to at least one stable release branch. label Mar 6, 2025
@elezar
Copy link
Member Author

elezar commented Mar 6, 2025

@cdesiniotis I am removing the must-backport label for this. Although this was triggered by a customer request, there is a workaround available and I would rather not backport another new hook to the release-1.17 branch.

@jgehrcke
Copy link

jgehrcke commented Mar 6, 2025

and I would rather not backport another new hook to the release-1.17 branch.

Sounds good.

if hostDriverVersion == "" {
m.logger.Debugf("Host driver version not specified")
return "", nil
}
if !containerRoot.hasPath(cudaCompatPath) {
if !containerRoot.HasPath(cudaCompatPath) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have picked a more or less arbitrary place in the code to put the following comment. Having more expressive variable names would make review for me easier. You can postpone looking at this comment until after merge, that's fine with me! In the spirit of learning what we do here however I'd appreciate an answer at some point :)

In my software engineering life I have always deeply cared about file system terminology and operations, and about making related code readable and self-expressive. For example,

  • I like to distinguish a "relative path" from an "absolute path" via variable name (if possible)
  • I like to distinguish a file object from a file path via variable name
  • I like to distinguish a path to a file from a path to a directory (expressing intent, of course these are technically the same)
  • I like to call things a "base name" for expressing intent a well (when there are no path separators, and this is supposed to be a "file name" or "directory name").

What is containerRoot in canonical unix file path terminology?

  • Is it guaranteed to point to a directory?
  • Is it an absolute path?
  • Is it always just /?

What are some properties about cudaCompatPath that we know/guarantee, that we could also express in the variable name or in the type?

  • Is it always a relative path (not starting with `/)?
  • Is it always a path to a directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out. I think we could do a pass through this and improve readability greatly. There is no rush on this particular PR (I've elected to hold the changes back from the v1.17.5 release) and as such, I think we can address these concerns here instead of as a follow-up.

The containerRootDir is the absolute path on the host filesystem to the root (/) of the container filesystem. It is specified in the OCI Runtime Specification as Root which informs the slightly ambiguous name. The containerRoot variable is the typed representation of this directory that allows us to attached helper functions such as HasPath / Resolve / Glob to it.

The cudaCompatPath is the absolute path to the directory containing the CUDA compat libraries in the container (if it exists). It defined as a constant with value /usr/local/cuda/compat. That is to say, it is always an absolute path to a directory in the container, and we're confirming that this exists, but have to calculate the path to this folder on the host filesystem to do this check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can address these concerns here instead of as a follow-up.

Thank you for that!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerRootDir is the absolute path on the host filesystem to the root (/) of the container filesystem

Is the container filesystem always accessible from within the host filesystem?

It is specified in the OCI Runtime Specification as Root which informs the slightly ambiguous name.

gotcha. Thanks for that background.

The cudaCompatPath is the absolute path to the directory containing the CUDA compat libraries in the container (if it exists).

Thank you for that precision. Now, that is quickly&easily understandable.

but have to calculate the path to this folder on the host filesystem to do this check

meaning that it's always mounted into the container, and never baked into it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning that it's always mounted into the container, and never baked into it?

No, I think there's a misunderstanding. The folder is baked into the container, but the hook is running on the host (in the Container's mount namespace). This means that we need to calculate the host path to the /usr/local/cuda/compat folder to check if it exists.

)

// A ContainerRoot represents the root filesystem of a container.
type ContainerRoot string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's just a string.

Relates to my previous comment: I'd love to see the intent expressed in the variable name: path to a directory?

And this is probably really just my lack of knowledge: I wonder: when is the container's root file system not located at just /?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added som context above. Let me know if it's not clear and I can add more information.

Copy link

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work! I understand very little but you have my full emotional support! And I know you will repair everything you destroy. In that sense: approved with honor. 🚀

@elezar elezar added this to the v1.18.0 milestone Mar 19, 2025
@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch 3 times, most recently from 7229201 to bdfaea4 Compare March 20, 2025 12:56
@@ -113,15 +113,15 @@ func (m command) run(c *cli.Context, cfg *config) error {
return fmt.Errorf("failed to load container state: %v", err)
}

containerRoot, err := s.GetContainerRoot()
containerRoot, err := s.GetContainerRootDirPath()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Thanks for adding clarity here.

And now that I have read your explanation.. this is a path that is valid in the host filesystem, and points to the container filesystem's root..? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the absolute path on the host where the root of the container's filesystem is located.

jgehrcke
jgehrcke previously approved these changes Mar 21, 2025
Copy link

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the iteration. Let's land this. I have left a few more questions while browsing this, but please don't slow down landing this because of my remarks.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 15707946920

Details

  • 9 of 233 (3.86%) changed or added relevant lines in 6 files are covered.
  • 27 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 33.356%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-cdi-hook/commands/commands.go 0 1 0.0%
internal/ldconfig/ldconfig_linux.go 0 2 0.0%
cmd/nvidia-cdi-hook/update-ldcache/update-ldcache.go 0 12 0.0%
cmd/nvidia-cdi-hook/create-soname-symlinks/soname-symlinks.go 0 88 0.0%
internal/ldconfig/ldconfig.go 0 121 0.0%
Files with Coverage Reduction New Missed Lines %
internal/discover/hooks.go 27 58.14%
Totals Coverage Status
Change from base Build 15683627970: -0.3%
Covered Lines: 4336
Relevant Lines: 12999

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 17, 2025

Pull Request Test Coverage Report for Build 15849722836

Details

  • 9 of 231 (3.9%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 33.387%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-cdi-hook/commands/commands.go 0 1 0.0%
internal/ldconfig/ldconfig_linux.go 0 2 0.0%
cmd/nvidia-cdi-hook/update-ldcache/update-ldcache.go 0 12 0.0%
cmd/nvidia-cdi-hook/create-soname-symlinks/soname-symlinks.go 0 88 0.0%
internal/ldconfig/ldconfig.go 0 119 0.0%
Totals Coverage Status
Change from base Build 15846172189: -0.3%
Covered Lines: 4377
Relevant Lines: 13110

💛 - Coveralls

@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from f653469 to bedd086 Compare June 17, 2025 17:54
@elezar elezar requested a review from jgehrcke June 18, 2025 10:10
@elezar elezar dismissed stale reviews from jgehrcke and ArangoGutierrez June 18, 2025 10:10

refactor

if ldconfig != "" {
args = append(args, "--ldconfig-path", ldconfig)
if d.ldconfigPath != "" {
args = append(args, "--ldconfig-path", d.ldconfigPath)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪

args = append(args, "-N")
}

// If we did not create the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did not create the rest of the sentence, then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}

type options struct {
folders cli.StringSlice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directoryPaths?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with Jan-P Directory is a much nicer call to Folder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options are filled by the CLI. A convention that I use is that the member matches (with the exception of plurality if required) the primary flag. Since the CLI argument is --folder (and can be repeated), I want to keep this folders.

Copy link

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we stop iterating on minor details, and just land this? :) Thank you, Evan!

@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from bedd086 to fddc1d3 Compare June 24, 2025 09:22
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick nits, I feel I lean into more towards directory than folder.
PR Looks good overall just naming convetion discussion and one misSpell

}

type options struct {
folders cli.StringSlice
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with Jan-P Directory is a much nicer call to Folder.

c.Flags = []cli.Flag{
&cli.StringSliceFlag{
Name: "folder",
Usage: "Specify a folder to generate soname symlinks in",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we end up deciding for directory over folder on the discussion above, let's make sure we update this string

&cli.StringSliceFlag{
Name: "folder",
Usage: "Specify a folder to generate soname symlinks in",
Destination: &cfg.folders,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

}
}

// createSonameSymlinks runs ldconfig enusures that soname symlinks are created
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// createSonameSymlinks runs ldconfig enusures that soname symlinks are created
// createSonameSymlinks runs ldconfig ensures that soname symlinks are created

//
// args[0] is the reexec initializer function name
// args[1] is the path of the ldconfig binary on the host
// args[2] is the container root directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use directory here, I guess this is a good clue on how to define the discussion above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory is different from the other directories that we're talking about.

//
// args[0] is the reexec initializer function name
// args[1] is the path of the ldconfig binary on the host
// args[2] is the container root directory
// The remaining args are folders that need to be added to the ldcache.
// The remaining args are folders where soname symlinks need to be created.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jeje folders vs dir conversation. I guess is was a nice Nit catch by @jgehrcke

}

// CreateSonameSymlinks uses ldconfig to create the soname symlinks in the
// specified folders.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folder vs dir

Comment on lines +114 to +116
// If the ld.so.conf.d directory exists, we create a config file there
// containing the required directories, otherwise we add the specified
// directories to the ldconfig command directly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we use dir but some lines above we use folder

@elezar
Copy link
Member Author

elezar commented Jun 24, 2025

Some quick nits, I feel I lean into more towards directory than folder.

Please note that the API for the hook uses --folder as a command line argument and this is not something that I want to change at the moment. I will update the local content to be more consistent though.

elezar added 2 commits June 24, 2025 13:49
This change adds a create-soname-symlinks hook that can be used to ensure
that the soname symlinks for injected libraries exist in a container.

This is done by calling ldconfig -n -N for the directories containing the injected
libraries.

This also ensures that libcuda.so is present in the ldcache when the update-ldcache
hook is run.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the ensure-libcuda.so-in-ldcache branch from fddc1d3 to 39975fc Compare June 24, 2025 11:54
@elezar elezar requested a review from ArangoGutierrez June 24, 2025 12:10
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments on folder vs directory have been addressed.
LGTM now

@elezar elezar merged commit c95d36d into NVIDIA:main Jun 24, 2025
16 checks passed
@elezar elezar deleted the ensure-libcuda.so-in-ldcache branch June 24, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LDCache does not include libcuda.so
5 participants