Skip to content

Conversation

@lhlxc
Copy link

@lhlxc lhlxc commented Oct 31, 2025

Introduce new plugin capacity-card to provide fine-grained GPU, NPU and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new scheduler plugin called capacity-card that enables fine-grained management and scheduling of heterogeneous GPU/NPU/accelerator card resources in Volcano.

Key capabilities:

  • Fine-grained card quota management: Queue-level resource quotas for different card types (e.g., A100, H100, V100) via annotations
  • Multi-card selection: Jobs can specify multiple acceptable card types (e.g., "A100|H100") for flexible scheduling
  • Mixed GPU sharing modes: Support for whole cards, MPS (Multi-Process Service), and MIG (Multi-Instance GPU) in the same cluster
  • Card resource discovery: Automatic detection of card types and resources from node labels
  • Fast feedback: Pre-validation of job card requests before enqueueing

Main components:

  • Core plugin implementation with event handling and resource tracking
  • Queue annotation volcano.sh/card.quota for card quota specification
  • Job annotation volcano.sh/card.request for card request validation
  • Task annotation volcano.sh/card.name for card type selection (supports multi-card syntax with |)
  • Comprehensive unit tests (9,000+ lines) and E2E tests
  • Design documentation and usage examples

Changes summary:

  • 31 files changed, 11,713+ insertions
  • Plugin implementation: 2,400+ lines
  • Tests: 9,000+ lines
  • Documentation and E2E tests

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  1. Plugin compatibility: This plugin should NOT be enabled simultaneously with the standard capacity or proportion plugin to avoid scheduling conflicts.

  2. Node label requirements: Proper node labels (e.g., nvidia.com/gpu.product, nvidia.com/gpu.count) must be configured for card discovery to work correctly. These are typically set by GPU operators like NVIDIA GPU Operator.

  3. Configuration option: The plugin supports cardUnlimitedCpuMemory configuration to bypass CPU/Memory quota checks when card resources are the primary constraint.

  4. Extensive test coverage: The PR includes comprehensive unit tests (~9,000 lines) and E2E tests to ensure reliability.

  5. Design document: Please review docs/design/capacity-card.md for detailed architecture, API design, and usage scenarios.

Does this PR introduce a user-facing change?

New scheduler plugin `capacity-card` for fine-grained GPU/NPU/accelerator card resource management in heterogeneous clusters. Supports multi-card selection, MPS/MIG sharing modes, and queue-level card quotas via annotations.

@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2025
@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 31, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @lhlxc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capacity-card scheduler plugin to Volcano, addressing the need for advanced resource management and scheduling of heterogeneous accelerator cards like GPUs and NPUs in Kubernetes clusters. It enhances Volcano's capabilities by allowing cluster administrators to define fine-grained card quotas at the queue level and enabling users to request specific or flexible card types for their jobs, including support for various GPU sharing modes. The plugin aims to optimize resource utilization and provide better control over specialized hardware in AI and HPC workloads.

Highlights

  • Fine-grained Card Management: Introduces a new capacity-card scheduler plugin for fine-grained GPU, NPU, and other accelerator card resource management in heterogeneous computing clusters.
  • Queue-level Quotas: Enables queue-level resource quotas for different card types (e.g., A100, H100, V100) via volcano.sh/card.quota annotations on queues.
  • Multi-card Selection: Allows jobs to specify multiple acceptable card types (e.g., "A100|H100") for flexible scheduling using volcano.sh/card.name annotations on tasks.
  • Mixed GPU Sharing Modes: Supports various GPU sharing modes, including whole cards, MPS (Multi-Process Service), and MIG (Multi-Instance GPU), within the same cluster.
  • Resource Discovery & Validation: Automatically detects card types and resources from node labels and performs pre-validation of job card requests against queue quotas for fast feedback.
  • CPU/Memory Unlimited Mode: Provides an optional cardUnlimitedCpuMemory configuration to bypass CPU/Memory quota checks when card resources are the primary constraint.
  • Comprehensive Testing: Includes extensive unit tests (~9,000 lines) and E2E tests to ensure the reliability and correctness of the new plugin.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new and powerful capacity-card scheduler plugin for fine-grained management of GPU and other accelerator resources. The changes are extensive, including the core plugin logic, a comprehensive design document, and a full suite of e2e tests. The overall approach is well-designed and aligns with Volcano's architecture. I've identified a couple of potentially high-severity issues related to resource accounting in the multi-card and deallocation logic that could impact scheduling accuracy. Additionally, I've provided some suggestions for minor improvements in documentation, error messages, and test scripts. This is a significant and valuable feature addition to Volcano.

metrics.UpdateQueueAllocated(
qAttr.name, qAttr.allocated.MilliCPU, qAttr.allocated.Memory, qAttr.allocated.ScalarResources,
)
klog.V(4).Infof(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The queue's share is not updated after deallocating resources. This can lead to incorrect fair-share calculations in subsequent scheduling cycles. You should call p.updateShare(qAttr) after updating the allocated resources, similar to how it's done in OnAllocate.

	)
	p.updateShare(qAttr)

Comment on lines 110 to 166
multiCardToBeUsedResource := toBeUsedResource.Clone()
// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.
// now all pods in the same job using the same kind of cards.
for _, cardName := range multiCardNames {
multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant
if result = CheckSingleScalarResource(
v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,
); result.Ok {
return result
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the multi-card check, multiCardToBeUsedResource is cloned outside the loop and modified within it. This causes the resource modification from checking one card type to be carried over to the check for the next card type, which is incorrect. The check for each card type should be independent. To fix this, you should clone toBeUsedResource inside the loop.

Suggested change
multiCardToBeUsedResource := toBeUsedResource.Clone()
// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.
// now all pods in the same job using the same kind of cards.
for _, cardName := range multiCardNames {
multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant
if result = CheckSingleScalarResource(
v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,
); result.Ok {
return result
}
}
// TODO: Support different Pods in the same job using different kind of cards, but a pod using one kind of card.
// now all pods in the same job using the same kind of cards.
for _, cardName := range multiCardNames {
multiCardToBeUsedResource := toBeUsedResource.Clone()
multiCardToBeUsedResource.ScalarResources[v1.ResourceName(cardName)] += scalarQuant
if result = CheckSingleScalarResource(
v1.ResourceName(cardName), scalarQuant, multiCardToBeUsedResource, queueCapability,
); result.Ok {
return result
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckSingleScalarResource only check one card type, other card type have no impact on the checking result.

Comment on lines 15 to 491
1. **Lack of Fine-grained Card Type Management**: Standard Kubernetes resource requests cannot distinguish between different GPU card types (e.g., A100 vs V100) or different GPU sharing profile (MPS, MIG etc.).

2. **Insufficient Queue-level Card Quota Control**: Organizations need to allocate specific numbers of different card types to different teams/projects, which cannot be easily achieved with native Kubernetes resource quotas.

3. **Inflexible Multi-Card Selection**: Jobs often can run on multiple types of cards with similar capabilities, but Kubernetes lacks a mechanism to express "this job can use card type A OR card type B".

The Capacity Card plugin addresses these challenges by providing:
- Annotation-based card resource specification for queues and jobs
- Support for multiple card types and sharing modes
- Multi-card selection capability (e.g., "use A100 or H100")
- Integration with Volcano's capacity scheduling framework

## In Scope

- Fine-grained card resource quota management at the queue level
- Job-level card resource request validation before enqueueing
- Task-level card name specification and allocation validation
- Support for MPS (Multi-Process Service) shared GPU resources
- Support for MIG (Multi-Instance GPU) shared GPU resources
- Support for whole card and mixed card/shared resource scenarios
- Multi-card selection support (allowing tasks to specify multiple acceptable card types)
- Automatic card resource discovery from node labels
- CPU/Memory unlimited mode for card resources (optional)

## Out of Scope

- Hierarchical queue card quota management
- Preemption and reclaim are not supported for now

## User Stories

### Story 1: Heterogeneous GPU Cluster Management

As a cluster administrator, I want to manage a cluster with multiple GPU types (A100, H100, V100) and allocate specific card quotas to different teams through queues.

For example:
- Team A (queue-a): 10 A100 cards, 5 H100 cards
- Team B (queue-b): 20 V100 cards, 3 A100 cards

### Story 2: Multi-Card Selection for Job Flexibility

As a data scientist, I want to submit a training job that can run on either A100 or H100 GPUs, whichever is available first, without creating separate job submissions.

### Story 3: Mixed Whole and Shared GPU Scheduling

As a platform engineer, I want to provide both whole GPU cards for large training jobs and MPS/MIG partitioned cards for inference services in the same cluster, with separate quota management.

### Story 4: Queue-level Card Quota Enforcement

As a resource manager, I want to ensure that no team can exceed their allocated card quota, even if cluster capacity is available, to enforce SLA agreements.

## Design Detail

### Architecture Overview

The Capacity Card plugin works by:
1. Discovering card resources from node labels and status
2. Parsing card quotas from queue annotations
3. Validating job card requests against queue card quotas
4. Tracking card resource allocation across jobs and tasks
5. Enforcing allocation limits during scheduling

### Key Concepts

#### Card Resource vs. K8s Resource

- **K8s Resource Name**: The actual resource name in node status and pod requests (e.g., `nvidia.com/gpu`, `nvidia.com/gpu.shared`, `nvidia.com/mig-1g.5gb`)
- **Card Name**: A user-friendly, normalized name for the card type (e.g., `NVIDIA-A100-80GB`, `NVIDIA-A100-80GB/mps-80g*1/8`)

The plugin maintains a mapping between card names and K8s resource names for scheduling decisions.

#### Card Types

1. **Whole Card**: Full GPU card resources (e.g., `nvidia.com/gpu`)
2. **MPS Shared Card**: NVIDIA MPS partitioned GPUs (e.g., `nvidia.com/gpu.shared`)
3. **MIG Shared Card**: NVIDIA MIG partitioned GPUs (e.g., `nvidia.com/mig-1g.5gb`)

#### Multi-Card Request

Tasks can specify multiple acceptable card types separated by `|`:
```
NVIDIA-A100-80GB|NVIDIA-H100-80GB
```

During scheduling, the plugin checks if any of the specified card types has sufficient quota in the queue.

### API Design

#### Queue Annotation for Card Quota

Queues use the annotation `volcano.sh/card.quota` to specify card resource quotas:

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: queue-a
annotations:
volcano.sh/card.quota: |
{
"NVIDIA-A100-80GB": 10,
"NVIDIA-H100-80GB": 5,
"NVIDIA-A100-80GB/mps-80g*1/8": 16
}
spec:
capability:
cpu: "100"
memory: "200Gi"
guarantee:
resource:
cpu: "50"
memory: "100Gi"
```
**Format**: JSON object mapping card names to counts (integers)
#### Job Annotation for Card Request
Jobs use the annotation `volcano.sh/card.request` to specify card resource requests for validation:

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: training-job
annotations:
volcano.sh/card.request: |
{
"NVIDIA-A100-80GB": 8
}
spec:
schedulerName: volcano
queue: queue-a
minAvailable: 1
tasks:
- replicas: 8
name: worker
template:
metadata:
annotations:
volcano.sh/card.name: "NVIDIA-A100-80GB"
spec:
containers:
- name: trainer
image: training:latest
resources:
limits:
nvidia.com/gpu: 1
```

**Purpose**: Pre-validation before job enqueueing to provide fast feedback

#### Task Annotation for Card Name

Tasks/Pods use the annotation `volcano.sh/card.name` to specify the desired card name:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: training-pod
annotations:
volcano.sh/card.name: "NVIDIA-A100-80GB|NVIDIA-H100-80GB"
spec:
schedulerName: volcano
containers:
- name: trainer
image: training:latest
resources:
limits:
nvidia.com/gpu: 1
```

**Multi-Card Format**: Use `|` to separate multiple acceptable card types. The scheduler will check quota availability for each type and allocate based on availability.

#### Plugin Configuration

The plugin supports configuration through scheduler config:

```yaml
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: capacity-card
arguments:
cardUnlimitedCpuMemory: true # Optional: if true, card resources don't require CPU/Memory quota
```

**Configuration Options**:
- `cardUnlimitedCpuMemory` (bool, default: false): If set to true, tasks requesting card resources are not checked against queue's CPU/Memory quota limits. Useful when card resources are the primary constraint.

### Node Card Discovery

The plugin automatically discovers card resources from node labels:

#### Label Format

```yaml
apiVersion: v1
kind: Node
metadata:
labels:
nvidia.com/gpu.product: "NVIDIA-A100-80GB" # Card product name
nvidia.com/gpu.count: "8" # Number of cards
nvidia.com/gpu.memory: "81920" # Memory per card in MB
nvidia.com/gpu.replicas: "8" # For MPS: number of replicas
nvidia.com/mig-1g.5gb.count: "7" # For MIG: count of this profile
status:
allocatable:
nvidia.com/gpu: "8" # Whole card resource
nvidia.com/gpu.shared: "64" # MPS shared resource
nvidia.com/mig-1g.5gb: "7" # MIG partition resource
```

#### Card Name Generation

- **Whole Card**: Uses the value from `<prefix>/gpu.product` label
- Example: `NVIDIA-A100-80GB`

- **MPS Card**: Generated as `<card-product>/mps-<memory>g*1/<replicas>`
- Example: `NVIDIA-A100-80GB/mps-80g*1/8`

- **MIG Card**: Generated as `<card-product>/mig-<profile>-mixed`
- Example: `NVIDIA-A100-80GB/mig-1g.5gb-mixed`

### Main Process

#### Plugin Initialization (OnSessionOpen)

1. **Build Total Resource**:
- List all nodes from the informer
- Extract card information from node labels
- Parse card resources from node status
- Build mapping: card name → K8s resource name
- Calculate cluster total resources (CPU, Memory, Cards)

2. **Build Queue Attributes**:
- Parse card quotas from queue annotations (`volcano.sh/card.quota`)
- Calculate queue capability, guarantee, and deserved resources
- Track allocated, inqueue, and elastic resources per queue

3. **Register Scheduling Functions**:
- `JobEnqueueableFn`: Pre-check job card requests against queue quota
- `AllocatableFn`: Validate task card allocation against queue quota
- `AllocateFunc` / `DeallocateFunc`: Update queue resource tracking

#### Job Enqueueable Check

When a job is submitted:

1. Parse job's card request from annotation (`volcano.sh/card.request`)
2. Calculate total resources to be used: `allocated + inqueue + job.minResources - elastic`
3. Check CPU/Memory quota (unless `cardUnlimitedCpuMemory` is enabled)
4. Check card resource quota:
- For each card type requested
- If multi-card request (contains `|`), check each alternative
- Verify: `totalToBeUsed[cardType] <= queueCapability[cardType]`
5. If all checks pass, mark job as InQueue and reserve resources

#### Task Allocatable Check

When scheduling a task:

1. Parse task's card request from annotation (`volcano.sh/card.name`)
2. Extract card resource from pod's resource requests
3. Calculate total resources to be allocated: `allocated + task.request`
4. Check CPU/Memory quota (unless `cardUnlimitedCpuMemory` is enabled)
5. Check card resource quota:
- Support multi-card selection (e.g., `A100|H100`)
- For multi-card, check each option and succeed if any passes
- Verify: `totalToBeAllocated[cardType] <= queueCapability[cardType]`
6. If checks fail, emit Kubernetes events to pod with reason

#### Resource Tracking

The plugin maintains real-time resource tracking:

- **On Allocate**: Add task resources to `queue.allocated`
- **On Deallocate**: Subtract task resources from `queue.allocated`
- **Queue Share Calculation**: `share = max(allocated[resource] / deserved[resource])`

### Implementation Details

#### Card Resource Quantification

Card resources are stored as scalar resources in milli-units (multiplied by 1000):
- 2 cards → 2000 in scalar resources
- This aligns with Volcano's internal resource representation

#### Multi-Card Request Processing

For a multi-card request like `A100|H100|V100`:

1. Split by `|` separator
2. For each card type in the list:
- Clone `toBeUsedResource`
- Add requested quantity to each individual card name
- Check if `toBeUsedResource[cardType] <= queueCapability[cardType]`
- If any card type passes, return success
3. If all fail, return the error with the multi-card name

#### Event Recording

The plugin emits Kubernetes events for:
- `GetTaskRequestResourceFailed`: Failed to parse task resource request
- `EmptyQueueCapability`: Queue has no capability configured
- `InsufficientCPUQuota`: Insufficient CPU quota in queue
- `InsufficientMemoryQuota`: Insufficient memory quota in queue
- `InsufficientScalarQuota`: Insufficient card/scalar quota in queue

### Integration with Capacity Scheduling

The Capacity Card plugin builds upon Volcano's capacity plugin concepts:

- **Capability**: Maximum card resources a queue can use
- **Guarantee**: Reserved card resources not shared with other queues
- **Deserved**: Target allocation for fair sharing and reclaim

However, unlike the standard capacity plugin, card resources are specified via annotations rather than the Queue's ResourceList fields, allowing more flexible card type specification.

### Metrics and Observability

The plugin exports Prometheus metrics for queue resource tracking:
- `volcano_queue_card_deserved`: Deserved card resources per queue
- `volcano_queue_card_allocated`: Currently allocated card resources per queue
- `volcano_queue_card_request`: Requested card resources per queue
- `volcano_queue_card_capacity`: Card capacity per queue

## Example Scenarios

### Example 1: Basic Card Quota

**Cluster Setup**:
- 2 nodes with 4 A100 cards each (total: 8 A100)

**Queue Configuration**:
```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-a
annotations:
volcano.sh/card.quota: '{"NVIDIA-A100-80GB": 5}'
spec:
capability:
cpu: "100"
memory: "500Gi"
```

**Job Submission**:
```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: training
annotations:
volcano.sh/card.request: '{"NVIDIA-A100-80GB": 4}'
spec:
queue: team-a
minAvailable: 4
tasks:
- replicas: 4
template:
metadata:
annotations:
volcano.sh/card.name: "NVIDIA-A100-80GB"
spec:
containers:
- name: worker
resources:
limits:
nvidia.com/gpu: 1
```

**Result**: Job successfully enqueued (4 ≤ 5) and tasks scheduled.

### Example 2: Multi-Card Selection

**Job Submission**:
```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: flexible-training
annotations:
volcano.sh/card.request: '{"NVIDIA-A100-80GB|NVIDIA-H100-80GB": 4}'
spec:
queue: team-a
minAvailable: 1
tasks:
- replicas: 4
template:
metadata:
annotations:
volcano.sh/card.name: "NVIDIA-A100-80GB|NVIDIA-H100-80GB"
spec:
containers:
- name: worker
resources:
limits:
nvidia.com/gpu: 1
```

**Result**: The scheduler will try to allocate A100 first; if quota exhausted, tries H100.

### Example 3: MPS Shared GPU

**Node Labels**:
```yaml
nvidia.com/gpu.product: "NVIDIA-A100-80GB"
nvidia.com/gpu.count: "4"
nvidia.com/gpu.memory: "81920"
nvidia.com/gpu.replicas: "8"
```

**Node Status**:
```yaml
status:
allocatable:
nvidia.com/gpu.shared: "32" # 4 cards × 8 replicas
```

**Queue Configuration**:
```yaml
metadata:
annotations:
volcano.sh/card.quota: '{"NVIDIA-A100-80GB/mps-80g*1/8": 32}'
```

**Job Submission**:
```yaml
metadata:
annotations:
volcano.sh/card.request: '{"NVIDIA-A100-80GB/mps-80g*1/8": 16}'
spec:
tasks:
- replicas: 16
template:
metadata:
annotations:
volcano.sh/card.name: "NVIDIA-A100-80GB/mps-80g*1/8"
spec:
containers:
- resources:
limits:
nvidia.com/gpu.shared: 1
```

**Result**: 16 inference pods share the 4 A100 GPUs via MPS.

## Notes

1. **Plugin Compatibility**: The Capacity Card plugin is designed to work alongside other Volcano plugins (gang, priority, etc.). It should not be enabled simultaneously with the standard `capacity` or `proportion` plugin to avoid conflicts.

2. **Card Discovery Requirements**: Node labels must be properly configured (typically by GPU operators like NVIDIA GPU Operator) for card discovery to work correctly.

3. **Annotation-based Design**: The choice of annotations over native Kubernetes ResourceList allows for:
- More flexible naming conventions
- Support for multi-card selection syntax
- Easier evolution without API changes

4. **Multi-Card Scheduling**: The current implementation checks quota for multi-card requests but enforce that all tasks in a job use the same card type. Future enhancements may add more flexible controls.

5. **Performance Considerations**: The plugin caches node card information to minimize overhead during scheduling cycles.

## Future Work

- Node-level card selection ordering function
- Support different Pods in the same job using different kind of cards
- Support Preemption and reclaim

## References

- [Volcano Capacity Scheduling Design](./capacity-scheduling.md)
- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/index.html)
- [NVIDIA MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)
- [Volcano Scheduler Framework](https://volcano.sh/en/docs/schduler_introduction/)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great and detailed design document. I've found a few minor typos that you might want to fix:

  • Line 15: profile should be profiles.
  • Line 483: kind should be kinds.
  • Line 484: Preemption should be preemption.
  • Line 491: schduler should be scheduler.

Comment on lines 148 to 159
# check if kind installed
function check-fake-gpu-operator {
echo "Checking fake gpu operator"
# retry get gpu resources on nodes
for true; do
kubectl get nodes -l nvidia.com/gpu.present=true | tail -n +2 | wc -l | grep 3
if [[ $? -eq 0 ]]; then
break
fi
sleep 1
done
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for check-fake-gpu-operator is misleading. It says # check if kind installed, but the function checks for the fake GPU operator. Also, the for true; do loop is unconventional. It's more common and readable to use while true; do for an infinite loop. The check itself is also a bit brittle. Consider updating the comment, the loop structure, and the check for clarity and robustness.

Suggested change
# check if kind installed
function check-fake-gpu-operator {
echo "Checking fake gpu operator"
# retry get gpu resources on nodes
for true; do
kubectl get nodes -l nvidia.com/gpu.present=true | tail -n +2 | wc -l | grep 3
if [[ $? -eq 0 ]]; then
break
fi
sleep 1
done
}
# check if fake gpu operator is ready
function check-fake-gpu-operator {
echo "Checking fake gpu operator"
# retry get gpu resources on nodes
while true; do
if [[ $(kubectl get nodes -l nvidia.com/gpu.present=true --no-headers | wc -l) -eq 3 ]]; then
break
fi
sleep 1
done
}

Comment on lines 226 to 227
"no resource <%s> defined in reqests/limits for card <%s>",
cardResourceName, cardName,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the error message: reqests should be requests.

Suggested change
"no resource <%s> defined in reqests/limits for card <%s>",
cardResourceName, cardName,
"no resource <%s> defined in requests/limits for card <%s>",

@JesseStutler
Copy link
Member

Hi @lhlxc , thanks for your contribution, I'd like to ask what's the difference between your plugin and volcano's current deviceshare plugin?

@lhlxc
Copy link
Author

lhlxc commented Oct 31, 2025

Hi @lhlxc , thanks for your contribution, I'd like to ask what's the difference between your plugin and volcano's current deviceshare plugin?

Thank you for the question! These two plugins serve complementary but different purposes in the scheduling pipeline:
The deviceshare plugin focuses on node-level device allocation, primarily handling node-level device filtering and fine-grained GPU memory/partition allocation.
The capacity-card plugin focuses on queue-level card type quota management, primarily enforcing quotas for heterogeneous card types.
For example, different card types (A100, H100, V100) share the same Kubernetes resource name nvidia.com/gpu, but we want to enforce quotas for each card type separately at the queue level. The `deviceshare plugin cannot handle this scenario.

The scheduling process of a job is as follows.

Scheduling Pipeline:

Job Submission
    ↓
[capacity-card: Job Enqueueable Check]
    ├─ Check: Does queue have enough A100/H100 quota?
    ├─ Rejects: Jobs exceeding queue card limits
    └─ Reserves: Card resources at queue level
    ↓
Job Enqueued
    ↓
Task Scheduling
    ↓
[capacity-card: Task Allocatable Check]
    ├─ Check: Will allocating this task exceed queue card quota?
    └─ Validates: Card type against queue capacity
    ↓
[deviceshare: Node Predicate Check]
    ├─ Check: Does this node have the requested device?
    ├─ Check: Does this node have enough GPU memory available?
    └─ Allocates: Specific GPU device and memory slice
    ↓
Pod Bound to Node

@lhlxc lhlxc force-pushed the feature/capacity-card-plugin branch 8 times, most recently from 7c789f5 to 518b975 Compare November 1, 2025 05:38
@JesseStutler
Copy link
Member

JesseStutler commented Nov 3, 2025

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine) @lhlxc

@lhlxc
Copy link
Author

lhlxc commented Nov 4, 2025

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.

Could you please let me know:

  • The meeting time and access link
  • How long should I prepare for the presentation
  • Any specific aspects you'd like me to focus on

Thank you!

@lhlxc lhlxc force-pushed the feature/capacity-card-plugin branch from 518b975 to b913110 Compare November 4, 2025 03:26
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wpeng102 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JesseStutler
Copy link
Member

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.

Could you please let me know:

  • The meeting time and access link
  • How long should I prepare for the presentation
  • Any specific aspects you'd like me to focus on

Thank you!

  • The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday
  • How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

@lhlxc
Copy link
Author

lhlxc commented Nov 5, 2025

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.
Could you please let me know:

  • The meeting time and access link
  • How long should I prepare for the presentation
  • Any specific aspects you'd like me to focus on

Thank you!

  • The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday
  • How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

Thank you for the meeting invitation and all the details!

I need to complete my company's internal approval process first, and unfortunately it cannot be finished by this Friday (November 7th).

Could we reschedule the meeting to a later date? I will inform you as soon as my approval is completed, and then we can confirm a new meeting time.

I sincerely apologize for any inconvenience this may cause and truly appreciate your understanding.

Best regards

@JesseStutler
Copy link
Member

Could you share your proposal on this week's volcano community meeting? (Use Mandarin is fine)

Yes, I'd be happy to share the proposal at this week's community meeting.
Could you please let me know:

  • The meeting time and access link
  • How long should I prepare for the presentation
  • Any specific aspects you'd like me to focus on

Thank you!

  • The meeting time and access link: The meeting will be held at 7th, Nov 15:00 (UTC+8), which is this Friday
  • How long should I prepare for the presentation: Whatever you like I think, propably about 30 minutes will be better

https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?tab=t.0 , I have added your agenda to this doc, the meeting link is https://zoom.us/j/91804791393, please download zoom first

Thank you for the meeting invitation and all the details!

I need to complete my company's internal approval process first, and unfortunately it cannot be finished by this Friday (November 7th).

Could we reschedule the meeting to a later date? I will inform you as soon as my approval is completed, and then we can confirm a new meeting time.

I sincerely apologize for any inconvenience this may cause and truly appreciate your understanding.

Best regards

OK no problem, we can arrange to 21st Nov if you're ready

… and other accelerator card resource management and scheduling capabilities in heterogeneous computing clusters
@lhlxc lhlxc force-pushed the feature/capacity-card-plugin branch from b913110 to 61a5f00 Compare November 8, 2025 07:27
@JesseStutler
Copy link
Member

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

@JesseStutler
Copy link
Member

Also /cc @archlitchi

@lhlxc
Copy link
Author

lhlxc commented Dec 2, 2025

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

@JesseStutler
Copy link
Member

JesseStutler commented Dec 3, 2025

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

@lhlxc
Copy link
Author

lhlxc commented Dec 3, 2025

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

We can establish a connection on WeChat.

@JesseStutler
Copy link
Member

@lhlxc Hi do you have time to participate in tomorrow's community meeting? The time will be 15:00 PM(UTC+8), meeting link is https://zoom.us/j/91804791393

Sorry for the delayed response, I’d be happy to join the next community meeting and please let me know the schedule for the next meeting.

The next meeting will be 5, Dec 15:00 PM(UTC+8), in this Friday. We can establish a connect first, do you have wechat or slack? @lhlxc

We can establish a connection on WeChat.

@lhlxc What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

@lhlxc
Copy link
Author

lhlxc commented Dec 5, 2025

What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

I’ve just replied to your email with my WeChat ID included in the message body (sent to jesseincomparable@hotmail.com, my email address is linhailisc@163.com).
Also, I’d like to confirm that I’m available and happy to join today’s Asia-friendly meeting from 15:00–16:00 (UTC+8). Looking forward to presenting my feature!

@JesseStutler
Copy link
Member

What's your email address? I saw that you didn't sign off your commit, or could you send me your wechat id through my email: jesseincomparable@hotmail.com ? We're going to hold the asia friendly meeting today 15:00-16:00 PM (UTC+8), do you have time to join and share your feature?

I’ve just replied to your email with my WeChat ID included in the message body (sent to jesseincomparable@hotmail.com, my email address is linhailisc@163.com). Also, I’d like to confirm that I’m available and happy to join today’s Asia-friendly meeting from 15:00–16:00 (UTC+8). Looking forward to presenting my feature!

OK, I will register your topic, welcome to share your feature in today's meeting

@4everming
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants