Skip to content

Conversation

@peachest
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:

The current HAMi implementation lacks sufficient support for multi-replica high availability.

Whilst the kube-scheduler has implemented master-slave election via Kubernetes' built-in leader election mechanism, the vgpu-scheduler-extender has not. Direct deployment of multiple replicas may encounter the following issues:

  1. Multiple instances repeatedly initiating handshakes with device plugins on nodes
  2. Multiple instances duplicating metric collection efforts Fix: Add leader check for metrics collect correctly #1432

Special notes for your reviewer:

The holderIdentity field is used for identification within the Lease. The kube-scheduler constructs the holderIdentity using the hostname. Taking version 1.34.2 as an example:

	// add a uniquifier so that two processes on the same host don't accidentally both become active
	id := hostname + "_" + string(uuid.NewUUID())

This PR determines whether the current vgpu-scheduler-extender container belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists in Lease.Spec.hoderIdentity.

  • Before executing RegisterFromNodeAnnotation each time, the scheduler must first determine whether it is the leader, ensuring only the leader instance initiates handshakes with nodes (and their device plugins).
  • Notify the current instance via a channel that it has been elected leader, and immediately execute RegisterFromNodeAnnotation even if the scheduled interval has not yet elapsed.
  • By defining the container's readinessProbe, ensure only the leader instance joins the hami-scheduler Service endpoints, preventing duplicate metric collection by other instances. This also affects webhooks, with only the primary instance's webhook now executing.

The hami-scheduler-newversion configuration employs SchedulerConfig to ensure that the kube-scheduler within each Pod exclusively accesses the local vgpu-scheduler-extender. Consequently, there is no concern regarding interference with calls to the Filter/ and Bind/ endpoints.

Does this PR introduce a user-facing change?: No

houyuxi added 9 commits December 22, 2025 20:10
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
1. update notify channel with buffer of size 1 in leaderManager
2. init notify chan for dummy

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@hami-robot
Copy link
Contributor

hami-robot bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot removed the kind/feature new function label Dec 22, 2025
@hami-robot hami-robot bot added the size/L label Dec 22, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @peachest, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the high availability of the hami-scheduler by integrating a Kubernetes-native leader election mechanism. This ensures that in a multi-replica setup, only a single instance of the vgpu-scheduler-extender acts as the primary, preventing conflicts like redundant device plugin handshakes and duplicate metric collection. The changes include new command-line flags, a dedicated leader election utility, and a readiness probe to manage service endpoint exposure based on leadership status, making the scheduler more resilient and efficient.

Highlights

  • Leader Election Implementation: Introduced a robust leader election mechanism for the vgpu-scheduler-extender using Kubernetes Lease objects to ensure high availability.
  • Problem Resolution: Addresses critical issues in multi-replica deployments, such as preventing duplicate handshakes with device plugins and redundant metric collection by ensuring only the leader performs these actions.
  • Dynamic Leadership Notification: Implemented a channel-based notification system to immediately inform an instance upon its election as leader, allowing it to promptly execute critical initialization tasks like RegisterFromNodeAnnotation.
  • Readiness Probe Integration: Added a readiness probe (/readyz endpoint) to the scheduler's deployment, ensuring that only the active leader instance is exposed via the hami-scheduler Service endpoints, thereby preventing non-leader instances from receiving traffic for webhooks and metrics.
  • Configuration Options: Exposed new Helm chart values (leader-elect, leader-elect-resource-name, leader-elect-resource-namespace) to enable and configure the leader election feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a master-slave high availability mechanism for the vgpu-scheduler-extender using Kubernetes' leader election. The changes are comprehensive, adding new CLI flags, a readiness probe, and a new leaderelection utility package to observe lease objects and determine leadership. The approach of piggybacking on kube-scheduler's leader election lease is clever.

However, I've found a few issues that need to be addressed. There's a critical compilation error in the new leaderelection package due to an incorrect interface implementation. Additionally, there are a couple of bugs related to configuration in the Helm chart and scheduler initialization. I've also included suggestions to improve the robustness of the leader election logic by checking for lease validity and using non-blocking channel sends.

Comment on lines +138 to +152
func (m *leaderManager) isLeaseValid(now time.Time) bool {
return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now)
}

func (m *leaderManager) IsLeader() bool {
m.leaseLock.RLock()
defer m.leaseLock.RUnlock()

if m.observedLease == nil {
return false
}

// TODO: should we check valid lease here?
return m.isHolderOf(m.observedLease)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function isLeaseValid is currently unused and is susceptible to a nil-pointer dereference if observedLease.Spec.LeaseDurationSeconds is nil. The IsLeader function includes a TODO to check for lease validity, which is a crucial check for correctness. Without it, the manager might incorrectly report being the leader if lease-related watch events are delayed.

I recommend making isLeaseValid nil-safe and integrating it into IsLeader to ensure the lease is not expired.

func (m *leaderManager) isLeaseValid(now time.Time) bool {
	if m.observedLease == nil || m.observedLease.Spec.LeaseDurationSeconds == nil {
		return false
	}
	return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now)
}

func (m *leaderManager) IsLeader() bool {
	m.leaseLock.RLock()
	defer m.leaseLock.RUnlock()

	if m.observedLease == nil {
		return false
	}

	if !m.isLeaseValid(time.Now()) {
		return false
	}

	return m.isHolderOf(m.observedLease)
}

m.setObservedRecord(lease)
// Notify if we are the leader from the very begging
if m.isHolderOf(lease) {
m.leaderNotify <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.

        select {
		case m.leaderNotify <- struct{}{}:
		default:
		}

}

if !m.isHolderOf(oldLease) {
m.leaderNotify <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.

            select {
			case m.leaderNotify <- struct{}{}:
			default:
			}

@peachest peachest changed the title Implementing master-slave high availability using leader-election Implementing leader-follower high availability using leader-election Dec 23, 2025
houyuxi added 2 commits December 23, 2025 14:27
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
…cheduler.admissionWebhook.enabled`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
houyuxi added 5 commits December 23, 2025 14:27
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
… correct call when failed

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

❌ Patch coverage is 68.31683% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/scheduler/scheduler.go 21.42% 9 Missing and 2 partials ⚠️
pkg/util/leaderelection/leaderelection.go 85.71% 7 Missing and 4 partials ⚠️
pkg/scheduler/routes/route.go 0.00% 10 Missing ⚠️
Flag Coverage Δ
unittests 51.25% <68.31%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/scheduler/config/config.go 78.45% <ø> (ø)
pkg/scheduler/routes/route.go 0.00% <0.00%> (ø)
pkg/scheduler/scheduler.go 50.96% <21.42%> (-0.41%) ⬇️
pkg/util/leaderelection/leaderelection.go 85.71% <85.71%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
// TODO: may be we should lock node when we are doing register.
// Only do registration when we are leader
if !s.leaderManager.IsLeader() {
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peachest I think it is too late to do registration when we are leader. When the kube-scheduler in the same Pod becomes the leader and sends a filter request to this extender, there is no guarantee this the registration will be done before handling that request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shouren I think this check is for addressing

Multiple instances repeatedly initiating handshakes with device plugins on nodes

But #1499 refine the handshake logic, so I'm not sure if the new sync logic still have this problem. If not, we don't need this section any more.

And for the requests, the leader control is delegated to the k8s-scheduler. Only the leader k8s-scheduler will pass requests to extender. So the extender don't need to care about that. As the description:

The holderIdentity field is used for identification within the Lease. The kube-scheduler constructs the holderIdentity using the hostname. Taking version 1.34.2 as an example:

// add a uniquifier so that two processes on the same host don't accidentally both become active
id := hostname + "_" + string(uuid.NewUUID())

This PR determines whether the current vgpu-scheduler-extender container belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists in Lease.Spec.hoderIdentity

Copy link
Contributor Author

@peachest peachest Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @Shouren means is that we should make sure finish doing RegisterFromNodeAnnotation at least one time before doing filter when the Filter/ endpoint is call.

I am tring to implement this by adding a synced flag and just wait for syncing before filtering. It's reasonable that, when we finished RegisterFromNodeAnnotation, we set synced as true, and set synced to false when we lost leadership because we won't do any register from now on.

Is there any better idea?

Copy link
Collaborator

@Shouren Shouren Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FouoF @peachest Let me explain it in detail, the check is achieved by following line

health, needUpdate := devInstance.CheckHealth(devhandsk, val)

and the implementation of CheckHealth differs for each devInstance, #1499 refactor the implementation of CheckHealth for NvidiaGPUDevices that it does not call device.CheckHealth any more. But there are four devInstances(kunlun, hygon, iluvatar & ascend) still calling device.CheckHealth, so we need this check.

However, the RegisterFromNodeAnnotations function will execute the following line

s.addNode(val.Name, nodeInfo)

to add information of node to nodeManager of scheduler when initializing scheduler or node needs to be updated. My concern is that current implementation skips this step before the kube-scheduler becomes the leader and it might be too late when a filter request comes but the nodes in that request has not been added to nodeManager.

Perhaps we need a new variable to indicate whether handshake is required, but i'm not sure if it will bring new issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any better idea?

@peachest Maybe we can add a new parameter to device.CheckHealth to control the behavior of handshake, so that only the leader(Let's put the fencing issue aside for now) can patch the annotations of node and keep the rest of code works as before. But i have to check the code to see if there are any conflicts with this implementation.

@archlitchi
Copy link
Member

looks like we have a problem with e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants