Skip to content

Conversation

@rahulait
Copy link

@rahulait rahulait commented Dec 4, 2025

Description

This PR introduces the ability to dynamically detect loaded modules.

When CDI is enabled:

Since k8s-device-plugin is resilient to missing drivers and does not crash if something is missing, we can always attempt to discover the supported devices/drivers.

When CDI is disabled:

We try to dynamically detect if the driver is loaded or not and based on that, set the relevant env var.

Testing

  • Unit tests
  • Manual cluster testing (describe below)
  • N/A or Other (docs, CI config, etc.)

Test details:

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines 36 to 37
// isDriverLoaded checks if the specified driver module is loaded by reading the modules file.
// It first checks "/host/proc/modules" (for containerized environments) and falls back to "/proc/modules" if not found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's keep a maximum column length for the docu-comments around 80.

})
}

// isDriverLoadedWithPath checks if the specified driver module is loaded by reading the specified modules file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's keep a maximum column length for the docu-comments around 80.

cdi.WithDeviceIDStrategy(*config.Flags.Plugin.DeviceIDStrategy),
cdi.WithVendor("k8s.device-plugin.nvidia.com"),
cdi.WithGdrcopyEnabled(*config.Flags.GDRCopyEnabled),
cdi.WithGdrcopyEnabled(*config.Flags.GDRCopyEnabled || utils.IsGdrdrvLoaded()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Since these are effectively no-ops if the drivers are not available, is it a simpler change to change the default for these settings instead of adding logic to check whether the modules are loaded?

Changes include:
1. For cdi, always attempt discovery of devices
2. For non-cdi, dynamically detect and enable if driver is loaded

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait rahulait force-pushed the dynamic-detect-gdrcopy branch from eb38ccb to 7a2c95a Compare December 25, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants