EKS karpenter integration issues

**Background**
The mock-device-plugin is essential for Karpenter integration with HAMi, as it registers GPU resource labels in node capacity (e.g., nvidia.com/gpucores), allowing Karpenter to properly recognize node initialization status and perform disruption operations.
However, we discovered three critical bugs that prevent the plugin from working correctly:

**Problem 1: Incorrect parsing method in GetNodeDevices**

Location: nvidia/device.go:101-105

**Issue Description:** The GetNodeDevices method attempts to parse device information from the node annotation hami.io/node-nvidia-register, but it uses the wrong parsing function:

Current implementation uses device.UnMarshalNodeDevices(), which expects JSON format
The actual annotation value is in comma-separated string format
This causes UnMarshalNodeDevices to fail
Solution: Should use device.DecodeNodeDevices() instead of device.UnMarshalNodeDevices()

**Reference Implementation**:
The correct usage pattern:
```
go

// Use DecodeNodeDevices for comma-separated string format
devices, err := device.DecodeNodeDevices(annotationValue)
if err != nil {
    // handle error
}
```

**Problem 2: Race condition between mock-device-plugin and HAMi device plugin**
Issue Description: 
1. One-time initialization: The Initialize() function is only called once at startup
2. External annotation dependency: AddResource() depends on node annotations like hami.io/node-nvidia-register
3. No retry mechanism: If annotations don't exist during initialization, there's no subsequent retry attempt

Solution: Implement proper synchronization mechanism to handle the startup sequence and resource registration order between mock-device-plugin and HAMi device plugin.

Suggested approach:
Add startup delay or readiness check
Implement proper locking mechanism
Use leader election or coordination to avoid conflicts


**Problem3: Static Resource Count Problem**
p.Count is fixed during initialization, and even if node annotations are updated later, ListAndWatch continues to report the same 
number of devices.

go
// In mock/server.go
func (p *MockPlugin) ListAndWatch(...) error {
    devs := make([]*kubeletdevicepluginv1beta1.Device, p.Count) // p.Count is fixed
    // ...
    s.Send(&kubeletdevicepluginv1beta1.ListAndWatchResponse{Devices: devs})
    for {
        time.Sleep(time.Second * 10)
        s.Send(&kubeletdevicepluginv1beta1.ListAndWatchResponse{Devices: devs}) // Always sends same device list
    }
}

Root Cause: The mock device plugin lacks a dynamic update mechanism. While kubelet does use updated allocatable and capacity values, it requires the device plugin 
to actively send updated device lists through the ListAndWatch stream. The current implementation missing this capability to:
1. Dynamically monitor node annotation changes
2. Recalculate resource counts  
3. Send updated device lists via ListAndWatch

**Environment**
Karpenter version: [1.8.2]
Kubernetes version: [1.34]
HAMi version: [2.7.1 ]
mock-device-plugin version: latest from main branch
Impact
These bugs prevent the mock-device-plugin from functioning correctly in production environments with Karpenter, blocking critical node scale-down operations.

Thank you for your attention to these issues!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EKS karpenter integration issues #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EKS karpenter integration issues #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions