introduce new plugin `capacity-card` to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters #4706

lhlxc · 2025-10-31T02:40:16Z

Introduce new plugin capacity-card to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new scheduler plugin called capacity-card that enables fine-grained management and scheduling of heterogeneous GPU/NPU/accelerator card resources in Volcano.

Key capabilities:

Fine-grained card quota management: Queue-level resource quotas for different card types (e.g., A100, H100, V100) via annotations
Multi-card selection: Jobs can specify multiple acceptable card types (e.g., "A100|H100") for flexible scheduling
Mixed GPU sharing modes: Support for whole cards, MPS (Multi-Process Service), and MIG (Multi-Instance GPU) in the same cluster
Card resource discovery: Automatic detection of card types and resources from node labels
Fast feedback: Pre-validation of job card requests before enqueueing

Main components:

Core plugin implementation with event handling and resource tracking
Queue annotation volcano.sh/card.quota for card quota specification
Job annotation volcano.sh/card.request for card request validation
Task annotation volcano.sh/card.name for card type selection (supports multi-card syntax with |)
Comprehensive unit tests (9,000+ lines) and E2E tests
Design documentation and usage examples

Changes summary:

31 files changed, 11,713+ insertions
Plugin implementation: 2,400+ lines
Tests: 9,000+ lines
Documentation and E2E tests

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Plugin compatibility: This plugin should NOT be enabled simultaneously with the standard capacity or proportion plugin to avoid scheduling conflicts.
Node label requirements: Proper node labels (e.g., nvidia.com/gpu.product, nvidia.com/gpu.count) must be configured for card discovery to work correctly. These are typically set by GPU operators like NVIDIA GPU Operator.
Configuration option: The plugin supports cardUnlimitedCpuMemory configuration to bypass CPU/Memory quota checks when card resources are the primary constraint.
Extensive test coverage: The PR includes comprehensive unit tests (~9,000 lines) and E2E tests to ensure reliability.
Design document: Please review docs/design/capacity-card.md for detailed architecture, API design, and usage scenarios.

Does this PR introduce a user-facing change?

New scheduler plugin `capacity-card` for fine-grained GPU/NPU/accelerator card resource management in heterogeneous clusters. Supports multi-card selection, MPS/MIG sharing modes, and queue-level card quotas via annotations.

gemini-code-assist · 2025-10-31T02:41:20Z

Summary of Changes

Hello @lhlxc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capacity-card scheduler plugin to Volcano, addressing the need for advanced resource management and scheduling of heterogeneous accelerator cards like GPUs and NPUs in Kubernetes clusters. It enhances Volcano's capabilities by allowing cluster administrators to define fine-grained card quotas at the queue level and enabling users to request specific or flexible card types for their jobs, including support for various GPU sharing modes. The plugin aims to optimize resource utilization and provide better control over specialized hardware in AI and HPC workloads.

Highlights

Fine-grained Card Management: Introduces a new capacity-card scheduler plugin for fine-grained GPU, NPU, and other accelerator card resource management in heterogeneous computing clusters.
Queue-level Quotas: Enables queue-level resource quotas for different card types (e.g., A100, H100, V100) via volcano.sh/card.quota annotations on queues.
Multi-card Selection: Allows jobs to specify multiple acceptable card types (e.g., "A100|H100") for flexible scheduling using volcano.sh/card.name annotations on tasks.
Mixed GPU Sharing Modes: Supports various GPU sharing modes, including whole cards, MPS (Multi-Process Service), and MIG (Multi-Instance GPU), within the same cluster.
Resource Discovery & Validation: Automatically detects card types and resources from node labels and performs pre-validation of job card requests against queue quotas for fast feedback.
CPU/Memory Unlimited Mode: Provides an optional cardUnlimitedCpuMemory configuration to bypass CPU/Memory quota checks when card resources are the primary constraint.
Comprehensive Testing: Includes extensive unit tests (~9,000 lines) and E2E tests to ensure the reliability and correctness of the new plugin.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new and powerful capacity-card scheduler plugin for fine-grained management of GPU and other accelerator resources. The changes are extensive, including the core plugin logic, a comprehensive design document, and a full suite of e2e tests. The overall approach is well-designed and aligns with Volcano's architecture. I've identified a couple of potentially high-severity issues related to resource accounting in the multi-card and deallocation logic that could impact scheduling accuracy. Additionally, I've provided some suggestions for minor improvements in documentation, error messages, and test scripts. This is a significant and valuable feature addition to Volcano.

gemini-code-assist · 2025-10-31T02:43:59Z

pkg/scheduler/plugins/capacity-card/capacity_card_event_handler.go

+	metrics.UpdateQueueAllocated(
+		qAttr.name, qAttr.allocated.MilliCPU, qAttr.allocated.Memory, qAttr.allocated.ScalarResources,
+	)
+	klog.V(4).Infof(


The queue's share is not updated after deallocating resources. This can lead to incorrect fair-share calculations in subsequent scheduling cycles. You should call p.updateShare(qAttr) after updating the allocated resources, similar to how it's done in OnAllocate.

) p.updateShare(qAttr)

gemini-code-assist · 2025-10-31T02:43:59Z

pkg/scheduler/plugins/capacity-card/capacity_card_utils.go

+		multiCardToBeUsedResource := toBeUsedResource.Clone()
+		// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.
+		// now all pods in the same job using the same kind of cards.
+		for _, cardName := range multiCardNames {
+			multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant
+			if result = CheckSingleScalarResource(
+				v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,
+			); result.Ok {
+				return result
+			}
+		}


In the multi-card check, multiCardToBeUsedResource is cloned outside the loop and modified within it. This causes the resource modification from checking one card type to be carried over to the check for the next card type, which is incorrect. The check for each card type should be independent. To fix this, you should clone toBeUsedResource inside the loop.

Suggested change

multiCardToBeUsedResource := toBeUsedResource.Clone()

// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.

// now all pods in the same job using the same kind of cards.

for _, cardName := range multiCardNames {

multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant

if result = CheckSingleScalarResource(

v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,

); result.Ok {

return result

}

}

// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.

// now all pods in the same job using the same kind of cards.

for _, cardName := range multiCardNames {

multiCardToBeUsedResource := toBeUsedResource.Clone()

multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant

if result = CheckSingleScalarResource(

v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,

); result.Ok {

return result

}

}

CheckSingleScalarResource only check one card type, other card type have no impact on the checking result.

gemini-code-assist · 2025-10-31T02:43:59Z

docs/design/capacity-card.md

+1. **Lack of Fine-grained Card Type Management**: Standard Kubernetes resource requests cannot distinguish between different GPU card types (e.g., A100 vs V100) or different GPU sharing profile (MPS, MIG etc.).
+
+2. **Insufficient Queue-level Card Quota Control**: Organizations need to allocate specific numbers of different card types to different teams/projects, which cannot be easily achieved with native Kubernetes resource quotas.
+
+3. **Inflexible Multi-Card Selection**: Jobs often can run on multiple types of cards with similar capabilities, but Kubernetes lacks a mechanism to express "this job can use card type A OR card type B".
+
+The Capacity Card plugin addresses these challenges by providing:
+- Annotation-based card resource specification for queues and jobs
+- Support for multiple card types and sharing modes
+- Multi-card selection capability (e.g., "use A100 or H100")
+- Integration with Volcano's capacity scheduling framework
+
+## In Scope
+
+- Fine-grained card resource quota management at the queue level
+- Job-level card resource request validation before enqueueing
+- Task-level card name specification and allocation validation
+- Support for MPS (Multi-Process Service) shared GPU resources
+- Support for MIG (Multi-Instance GPU) shared GPU resources
+- Support for whole card and mixed card/shared resource scenarios
+- Multi-card selection support (allowing tasks to specify multiple acceptable card types)
+- Automatic card resource discovery from node labels
+- CPU/Memory unlimited mode for card resources (optional)
+
+## Out of Scope
+
+- Hierarchical queue card quota management
+- Preemption and reclaim are not supported for now
+
+## User Stories
+
+### Story 1: Heterogeneous GPU Cluster Management
+
+As a cluster administrator, I want to manage a cluster with multiple GPU types (A100, H100, V100) and allocate specific card quotas to different teams through queues.
+
+For example:
+- Team A (queue-a): 10 A100 cards, 5 H100 cards
+- Team B (queue-b): 20 V100 cards, 3 A100 cards
+
+### Story 2: Multi-Card Selection for Job Flexibility
+
+As a data scientist, I want to submit a training job that can run on either A100 or H100 GPUs, whichever is available first, without creating separate job submissions.
+
+### Story 3: Mixed Whole and Shared GPU Scheduling
+
+As a platform engineer, I want to provide both whole GPU cards for large training jobs and MPS/MIG partitioned cards for inference services in the same cluster, with separate quota management.
+
+### Story 4: Queue-level Card Quota Enforcement
+
+As a resource manager, I want to ensure that no team can exceed their allocated card quota, even if cluster capacity is available, to enforce SLA agreements.
+
+## Design Detail
+
+### Architecture Overview
+
+The Capacity Card plugin works by:
+1. Discovering card resources from node labels and status
+2. Parsing card quotas from queue annotations
+3. Validating job card requests against queue card quotas
+4. Tracking card resource allocation across jobs and tasks
+5. Enforcing allocation limits during scheduling
+
+### Key Concepts
+
+#### Card Resource vs. K8s Resource
+
+- **K8s Resource Name**: The actual resource name in node status and pod requests (e.g., `nvidia.com/gpu`, `nvidia.com/gpu.shared`, `nvidia.com/mig-1g.5gb`)
+- **Card Name**: A user-friendly, normalized name for the card type (e.g., `NVIDIA-A100-80GB`, `NVIDIA-A100-80GB/mps-80g*1/8`)
+
+The plugin maintains a mapping between card names and K8s resource names for scheduling decisions.
+
+#### Card Types
+
+1. **Whole Card**: Full GPU card resources (e.g., `nvidia.com/gpu`)
+2. **MPS Shared Card**: NVIDIA MPS partitioned GPUs (e.g., `nvidia.com/gpu.shared`)
+3. **MIG Shared Card**: NVIDIA MIG partitioned GPUs (e.g., `nvidia.com/mig-1g.5gb`)
+
+#### Multi-Card Request
+
+Tasks can specify multiple acceptable card types separated by `|`:
+```
+NVIDIA-A100-80GB|NVIDIA-H100-80GB
+```
+
+During scheduling, the plugin checks if any of the specified card types has sufficient quota in the queue.
+
+### API Design
+
+#### Queue Annotation for Card Quota
+
+Queues use the annotation `volcano.sh/card.quota` to specify card resource quotas:
+
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: Queue
+metadata:
+  name: queue-a
+  annotations:
+    volcano.sh/card.quota: |
+      {
+        "NVIDIA-A100-80GB": 10,
+        "NVIDIA-H100-80GB": 5,
+        "NVIDIA-A100-80GB/mps-80g*1/8": 16
+      }
+spec:
+  capability:
+    cpu: "100"
+    memory: "200Gi"
+  guarantee:
+    resource:
+      cpu: "50"
+      memory: "100Gi"
+```
+
+**Format**: JSON object mapping card names to counts (integers)
+
+#### Job Annotation for Card Request
+
+Jobs use the annotation `volcano.sh/card.request` to specify card resource requests for validation:
+
+```yaml
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: training-job
+  annotations:
+    volcano.sh/card.request: |
+      {
+        "NVIDIA-A100-80GB": 8
+      }
+spec:
+  schedulerName: volcano
+  queue: queue-a
+  minAvailable: 1
+  tasks:
+    - replicas: 8
+      name: worker
+      template:
+        metadata:
+          annotations:
+            volcano.sh/card.name: "NVIDIA-A100-80GB"
+        spec:
+          containers:
+            - name: trainer
+              image: training:latest
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+```
+
+**Purpose**: Pre-validation before job enqueueing to provide fast feedback
+
+#### Task Annotation for Card Name
+
+Tasks/Pods use the annotation `volcano.sh/card.name` to specify the desired card name:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: training-pod
+  annotations:
+    volcano.sh/card.name: "NVIDIA-A100-80GB|NVIDIA-H100-80GB"
+spec:
+  schedulerName: volcano
+  containers:
+    - name: trainer
+      image: training:latest
+      resources:
+        limits:
+          nvidia.com/gpu: 1
+```
+
+**Multi-Card Format**: Use `|` to separate multiple acceptable card types. The scheduler will check quota availability for each type and allocate based on availability.
+
+#### Plugin Configuration
+
+The plugin supports configuration through scheduler config:
+
+```yaml
+actions: "enqueue, allocate, backfill"
+tiers:
+  - plugins:
+      - name: capacity-card
+        arguments:
+          cardUnlimitedCpuMemory: true  # Optional: if true, card resources don't require CPU/Memory quota
+```
+
+**Configuration Options**:
+- `cardUnlimitedCpuMemory` (bool, default: false): If set to true, tasks requesting card resources are not checked against queue's CPU/Memory quota limits. Useful when card resources are the primary constraint.
+
+### Node Card Discovery
+
+The plugin automatically discovers card resources from node labels:
+
+#### Label Format
+
+```yaml
+apiVersion: v1
+kind: Node
+metadata:
+  labels:
+    nvidia.com/gpu.product: "NVIDIA-A100-80GB"     # Card product name
+    nvidia.com/gpu.count: "8"                       # Number of cards
+    nvidia.com/gpu.memory: "81920"                  # Memory per card in MB
+    nvidia.com/gpu.replicas: "8"                    # For MPS: number of replicas
+    nvidia.com/mig-1g.5gb.count: "7"               # For MIG: count of this profile
+status:
+  allocatable:
+    nvidia.com/gpu: "8"                             # Whole card resource
+    nvidia.com/gpu.shared: "64"                     # MPS shared resource
+    nvidia.com/mig-1g.5gb: "7"                      # MIG partition resource
+```
+
+#### Card Name Generation
+
+- **Whole Card**: Uses the value from `<prefix>/gpu.product` label
+  - Example: `NVIDIA-A100-80GB`
+
+- **MPS Card**: Generated as `<card-product>/mps-<memory>g*1/<replicas>`
+  - Example: `NVIDIA-A100-80GB/mps-80g*1/8`
+
+- **MIG Card**: Generated as `<card-product>/mig-<profile>-mixed`
+  - Example: `NVIDIA-A100-80GB/mig-1g.5gb-mixed`
+
+### Main Process
+
+#### Plugin Initialization (OnSessionOpen)
+
+1. **Build Total Resource**:
+   - List all nodes from the informer
+   - Extract card information from node labels
+   - Parse card resources from node status
+   - Build mapping: card name → K8s resource name
+   - Calculate cluster total resources (CPU, Memory, Cards)
+
+2. **Build Queue Attributes**:
+   - Parse card quotas from queue annotations (`volcano.sh/card.quota`)
+   - Calculate queue capability, guarantee, and deserved resources
+   - Track allocated, inqueue, and elastic resources per queue
+
+3. **Register Scheduling Functions**:
+   - `JobEnqueueableFn`: Pre-check job card requests against queue quota
+   - `AllocatableFn`: Validate task card allocation against queue quota
+   - `AllocateFunc` / `DeallocateFunc`: Update queue resource tracking
+
+#### Job Enqueueable Check
+
+When a job is submitted:
+
+1. Parse job's card request from annotation (`volcano.sh/card.request`)
+2. Calculate total resources to be used: `allocated + inqueue + job.minResources - elastic`
+3. Check CPU/Memory quota (unless `cardUnlimitedCpuMemory` is enabled)
+4. Check card resource quota:
+   - For each card type requested
+   - If multi-card request (contains `|`), check each alternative
+   - Verify: `totalToBeUsed[cardType] <= queueCapability[cardType]`
+5. If all checks pass, mark job as InQueue and reserve resources
+
+#### Task Allocatable Check
+
+When scheduling a task:
+
+1. Parse task's card request from annotation (`volcano.sh/card.name`)
+2. Extract card resource from pod's resource requests
+3. Calculate total resources to be allocated: `allocated + task.request`
+4. Check CPU/Memory quota (unless `cardUnlimitedCpuMemory` is enabled)
+5. Check card resource quota:
+   - Support multi-card selection (e.g., `A100|H100`)
+   - For multi-card, check each option and succeed if any passes
+   - Verify: `totalToBeAllocated[cardType] <= queueCapability[cardType]`
+6. If checks fail, emit Kubernetes events to pod with reason
+
+#### Resource Tracking
+
+The plugin maintains real-time resource tracking:
+
+- **On Allocate**: Add task resources to `queue.allocated`
+- **On Deallocate**: Subtract task resources from `queue.allocated`
+- **Queue Share Calculation**: `share = max(allocated[resource] / deserved[resource])`
+
+### Implementation Details
+
+#### Card Resource Quantification
+
+Card resources are stored as scalar resources in milli-units (multiplied by 1000):
+- 2 cards → 2000 in scalar resources
+- This aligns with Volcano's internal resource representation
+
+#### Multi-Card Request Processing
+
+For a multi-card request like `A100|H100|V100`:
+
+1. Split by `|` separator
+2. For each card type in the list:
+   - Clone `toBeUsedResource`
+   - Add requested quantity to each individual card name
+   - Check if `toBeUsedResource[cardType] <= queueCapability[cardType]`
+   - If any card type passes, return success
+3. If all fail, return the error with the multi-card name
+
+#### Event Recording
+
+The plugin emits Kubernetes events for:
+- `GetTaskRequestResourceFailed`: Failed to parse task resource request
+- `EmptyQueueCapability`: Queue has no capability configured
+- `InsufficientCPUQuota`: Insufficient CPU quota in queue
+- `InsufficientMemoryQuota`: Insufficient memory quota in queue
+- `InsufficientScalarQuota`: Insufficient card/scalar quota in queue
+
+### Integration with Capacity Scheduling
+
+The Capacity Card plugin builds upon Volcano's capacity plugin concepts:
+
+- **Capability**: Maximum card resources a queue can use
+- **Guarantee**: Reserved card resources not shared with other queues
+- **Deserved**: Target allocation for fair sharing and reclaim
+
+However, unlike the standard capacity plugin, card resources are specified via annotations rather than the Queue's ResourceList fields, allowing more flexible card type specification.
+
+### Metrics and Observability
+
+The plugin exports Prometheus metrics for queue resource tracking:
+- `volcano_queue_card_deserved`: Deserved card resources per queue
+- `volcano_queue_card_allocated`: Currently allocated card resources per queue
+- `volcano_queue_card_request`: Requested card resources per queue
+- `volcano_queue_card_capacity`: Card capacity per queue
+
+## Example Scenarios
+
+### Example 1: Basic Card Quota
+
+**Cluster Setup**:
+- 2 nodes with 4 A100 cards each (total: 8 A100)
+
+**Queue Configuration**:
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: Queue
+metadata:
+  name: team-a
+  annotations:
+    volcano.sh/card.quota: '{"NVIDIA-A100-80GB": 5}'
+spec:
+  capability:
+    cpu: "100"
+    memory: "500Gi"
+```
+
+**Job Submission**:
+```yaml
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: training
+  annotations:
+    volcano.sh/card.request: '{"NVIDIA-A100-80GB": 4}'
+spec:
+  queue: team-a
+  minAvailable: 4
+  tasks:
+    - replicas: 4
+      template:
+        metadata:
+          annotations:
+            volcano.sh/card.name: "NVIDIA-A100-80GB"
+        spec:
+          containers:
+            - name: worker
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+```
+
+**Result**: Job successfully enqueued (4 ≤ 5) and tasks scheduled.
+
+### Example 2: Multi-Card Selection
+
+**Job Submission**:
+```yaml
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: flexible-training
+  annotations:
+    volcano.sh/card.request: '{"NVIDIA-A100-80GB|NVIDIA-H100-80GB": 4}'
+spec:
+  queue: team-a
+  minAvailable: 1
+  tasks:
+    - replicas: 4
+      template:
+        metadata:
+          annotations:
+            volcano.sh/card.name: "NVIDIA-A100-80GB|NVIDIA-H100-80GB"
+        spec:
+          containers:
+            - name: worker
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+```
+
+**Result**: The scheduler will try to allocate A100 first; if quota exhausted, tries H100.
+
+### Example 3: MPS Shared GPU
+
+**Node Labels**:
+```yaml
+nvidia.com/gpu.product: "NVIDIA-A100-80GB"
+nvidia.com/gpu.count: "4"
+nvidia.com/gpu.memory: "81920"
+nvidia.com/gpu.replicas: "8"
+```
+
+**Node Status**:
+```yaml
+status:
+  allocatable:
+    nvidia.com/gpu.shared: "32"  # 4 cards × 8 replicas
+```
+
+**Queue Configuration**:
+```yaml
+metadata:
+  annotations:
+    volcano.sh/card.quota: '{"NVIDIA-A100-80GB/mps-80g*1/8": 32}'
+```
+
+**Job Submission**:
+```yaml
+metadata:
+  annotations:
+    volcano.sh/card.request: '{"NVIDIA-A100-80GB/mps-80g*1/8": 16}'
+spec:
+  tasks:
+    - replicas: 16
+      template:
+        metadata:
+          annotations:
+            volcano.sh/card.name: "NVIDIA-A100-80GB/mps-80g*1/8"
+        spec:
+          containers:
+            - resources:
+                limits:
+                  nvidia.com/gpu.shared: 1
+```
+
+**Result**: 16 inference pods share the 4 A100 GPUs via MPS.
+
+## Notes
+
+1. **Plugin Compatibility**: The Capacity Card plugin is designed to work alongside other Volcano plugins (gang, priority, etc.). It should not be enabled simultaneously with the standard `capacity` or `proportion` plugin to avoid conflicts.
+
+2. **Card Discovery Requirements**: Node labels must be properly configured (typically by GPU operators like NVIDIA GPU Operator) for card discovery to work correctly.
+
+3. **Annotation-based Design**: The choice of annotations over native Kubernetes ResourceList allows for:
+   - More flexible naming conventions
+   - Support for multi-card selection syntax
+   - Easier evolution without API changes
+
+4. **Multi-Card Scheduling**: The current implementation checks quota for multi-card requests but enforce that all tasks in a job use the same card type. Future enhancements may add more flexible controls.
+
+5. **Performance Considerations**: The plugin caches node card information to minimize overhead during scheduling cycles.
+
+## Future Work
+
+- Node-level card selection ordering function
+- Support different Pods in the same job using different kind of cards
+- Support Preemption and reclaim
+
+## References
+
+- [Volcano Capacity Scheduling Design](./capacity-scheduling.md)
+- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/index.html)
+- [NVIDIA MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)
+- [Volcano Scheduler Framework](https://volcano.sh/en/docs/schduler_introduction/)


This is a great and detailed design document. I've found a few minor typos that you might want to fix:

Line 15: profile should be profiles.

Line 483: kind should be kinds.

Line 484: Preemption should be preemption.

Line 491: schduler should be scheduler.

gemini-code-assist · 2025-10-31T02:43:59Z

hack/lib/install.sh

+# check if kind installed
+function check-fake-gpu-operator {
+  echo "Checking fake gpu operator"
+  # retry get gpu resources on nodes
+  for true; do
+    kubectl get nodes -l nvidia.com/gpu.present=true | tail -n +2 | wc -l | grep 3
+    if [[ $? -eq 0 ]]; then
+      break
+    fi
+    sleep 1
+  done
+}


The comment for check-fake-gpu-operator is misleading. It says # check if kind installed, but the function checks for the fake GPU operator. Also, the for true; do loop is unconventional. It's more common and readable to use while true; do for an infinite loop. The check itself is also a bit brittle. Consider updating the comment, the loop structure, and the check for clarity and robustness.

Suggested change

# check if kind installed

function check-fake-gpu-operator {

echo "Checking fake gpu operator"

# retry get gpu resources on nodes

for true; do

kubectl get nodes -l nvidia.com/gpu.present=true | tail -n +2 | wc -l | grep 3

if [[ $? -eq 0 ]]; then

break

fi

sleep 1

done

}

# check if fake gpu operator is ready

function check-fake-gpu-operator {

echo "Checking fake gpu operator"

# retry get gpu resources on nodes

while true; do

if [[ $(kubectl get nodes -l nvidia.com/gpu.present=true --no-headers | wc -l) -eq 3 ]]; then

break

fi

sleep 1

done

}

gemini-code-assist · 2025-10-31T02:43:59Z

pkg/scheduler/plugins/capacity-card/capacity_card_job.go

+		"no resource <%s> defined in reqests/limits for card <%s>",
+		cardResourceName, cardName,


There is a typo in the error message: reqests should be requests.

Suggested change

"no resource <%s> defined in reqests/limits for card <%s>",

cardResourceName, cardName,

"no resource <%s> defined in requests/limits for card <%s>",

JesseStutler · 2025-10-31T03:07:51Z

Hi @lhlxc , thanks for your contribution, I'd like to ask what's the difference between your plugin and volcano's current deviceshare plugin?

lhlxc · 2025-10-31T03:45:48Z

Hi @lhlxc , thanks for your contribution, I'd like to ask what's the difference between your plugin and volcano's current deviceshare plugin?

Thank you for the question! These two plugins serve complementary but different purposes in the scheduling pipeline:
The deviceshare plugin focuses on node-level device allocation, primarily handling node-level device filtering and fine-grained GPU memory/partition allocation.
The capacity-card plugin focuses on queue-level card type quota management, primarily enforcing quotas for heterogeneous card types.
For example, different card types (A100, H100, V100) share the same Kubernetes resource name nvidia.com/gpu, but we want to enforce quotas for each card type separately at the queue level. The `deviceshare plugin cannot handle this scenario.

The scheduling process of a job is as follows.

Scheduling Pipeline:

Job Submission
    ↓
[capacity-card: Job Enqueueable Check]
    ├─ Check: Does queue have enough A100/H100 quota?
    ├─ Rejects: Jobs exceeding queue card limits
    └─ Reserves: Card resources at queue level
    ↓
Job Enqueued
    ↓
Task Scheduling
    ↓
[capacity-card: Task Allocatable Check]
    ├─ Check: Will allocating this task exceed queue card quota?
    └─ Validates: Card type against queue capacity
    ↓
[deviceshare: Node Predicate Check]
    ├─ Check: Does this node have the requested device?
    ├─ Check: Does this node have enough GPU memory available?
    └─ Allocates: Specific GPU device and memory slice
    ↓
Pod Bound to Node

JesseStutler · 2025-11-03T03:14:44Z

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine) @lhlxc

lhlxc · 2025-11-04T01:35:46Z

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.

Could you please let me know:

The meeting time and access link
How long should I prepare for the presentation
Any specific aspects you'd like me to focus on

Thank you!

volcano-sh-bot · 2025-11-04T03:26:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wpeng102 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JesseStutler · 2025-11-04T11:46:53Z

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.

Could you please let me know:

The meeting time and access link

How long should I prepare for the presentation

Any specific aspects you'd like me to focus on

Thank you!

The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday
How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

lhlxc · 2025-11-05T07:18:28Z

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.
Could you please let me know:

The meeting time and access link

How long should I prepare for the presentation

Any specific aspects you'd like me to focus on

Thank you!

The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday

How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

Thank you for the meeting invitation and all the details!

I need to complete my company's internal approval process first, and unfortunately it cannot be finished by this Friday (November 7th).

Could we reschedule the meeting to a later date? I will inform you as soon as my approval is completed, and then we can confirm a new meeting time.

I sincerely apologize for any inconvenience this may cause and truly appreciate your understanding.

Best regards

JesseStutler · 2025-11-05T07:36:32Z

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.
Could you please let me know:

The meeting time and access link

How long should I prepare for the presentation

Any specific aspects you'd like me to focus on

Thank you!

The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday

How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

Thank you for the meeting invitation and all the details!

I need to complete my company's internal approval process first, and unfortunately it cannot be finished by this Friday (November 7th).

Could we reschedule the meeting to a later date? I will inform you as soon as my approval is completed, and then we can confirm a new meeting time.

I sincerely apologize for any inconvenience this may cause and truly appreciate your understanding.

Best regards

OK no problem, we can arrange to 21st Nov if you're ready

… and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters

JesseStutler · 2025-11-20T11:33:28Z

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

JesseStutler · 2025-11-20T11:33:46Z

Also /cc @archlitchi

lhlxc · 2025-12-02T09:23:54Z

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

JesseStutler · 2025-12-03T01:30:49Z

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

lhlxc · 2025-12-03T09:28:53Z

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

We can establish a connection on WeChat.

JesseStutler · 2025-12-05T01:38:33Z

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

We can establish a connection on WeChat.

@lhlxc What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

lhlxc · 2025-12-05T02:38:23Z

What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

I’ve just replied to your email with my WeChat ID included in the message body (sent to jesseincomparable@hotmail.com, my email address is linhailisc@163.com).
Also, I’d like to confirm that I’m available and happy to join today’s Asia-friendly meeting from 15:00–16:00 (UTC+8). Looking forward to presenting my feature!

JesseStutler · 2025-12-05T02:54:19Z

What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

I’ve just replied to your email with my WeChat ID included in the message body (sent to jesseincomparable@hotmail.com, my email address is linhailisc@163.com). Also, I’d like to confirm that I’m available and happy to join today’s Asia-friendly meeting from 15:00–16:00 (UTC+8). Looking forward to presenting my feature!

OK, I will register your topic, welcome to share your feature in today's meeting

4everming · 2025-12-05T08:20:04Z

This is the per node config for nvidia deviceplugin, just FYI: https://github.com/NVIDIA/k8s-device-plugin/tree/main?tab=readme-ov-file#updating-per-node-configuration-with-a-node-label

volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2025

volcano-sh-bot requested review from alcorj-mizar and jiangkaihua October 31, 2025 02:40

volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 31, 2025

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

lhlxc force-pushed the feature/capacity-card-plugin branch 8 times, most recently from 7c789f5 to 518b975 Compare November 1, 2025 05:38

lhlxc force-pushed the feature/capacity-card-plugin branch from 518b975 to b913110 Compare November 4, 2025 03:26

introduce new plugin capacity-card to provide fine-grained GPU, NPU…

61a5f00

… and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters

lhlxc force-pushed the feature/capacity-card-plugin branch from b913110 to 61a5f00 Compare November 8, 2025 07:27

		"no resource <%s> defined in reqests/limits for card <%s>",
		cardResourceName, cardName,

	"no resource <%s> defined in reqests/limits for card <%s>",
	cardResourceName, cardName,
	"no resource <%s> defined in requests/limits for card <%s>",

introduce new plugin capacity-card to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters #4706

Are you sure you want to change the base?

introduce new plugin capacity-card to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters #4706

Uh oh!

Conversation

lhlxc commented Oct 31, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

gemini-code-assist bot commented Oct 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

lhlxc Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Oct 31, 2025

Uh oh!

lhlxc commented Oct 31, 2025

Uh oh!

JesseStutler commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhlxc commented Nov 4, 2025

Uh oh!

volcano-sh-bot commented Nov 4, 2025

Uh oh!

JesseStutler commented Nov 4, 2025

Uh oh!

lhlxc commented Nov 5, 2025

Uh oh!

JesseStutler commented Nov 5, 2025

Uh oh!

JesseStutler commented Nov 20, 2025

Uh oh!

JesseStutler commented Nov 20, 2025

Uh oh!

lhlxc commented Dec 2, 2025

Uh oh!

JesseStutler commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhlxc commented Dec 3, 2025

Uh oh!

JesseStutler commented Dec 5, 2025

Uh oh!

lhlxc commented Dec 5, 2025

Uh oh!

JesseStutler commented Dec 5, 2025

Uh oh!

4everming commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

introduce new plugin `capacity-card` to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters #4706

introduce new plugin `capacity-card` to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters #4706

JesseStutler commented Nov 3, 2025 •

edited

Loading

JesseStutler commented Dec 3, 2025 •

edited

Loading