Implementing leader-follower high availability using leader-election #1553

peachest · 2025-12-22T12:46:13Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

The current HAMi implementation lacks sufficient support for multi-replica high availability.

Whilst the kube-scheduler has implemented master-slave election via Kubernetes' built-in leader election mechanism, the vgpu-scheduler-extender has not. Direct deployment of multiple replicas may encounter the following issues:

Multiple instances repeatedly initiating handshakes with device plugins on nodes
Multiple instances duplicating metric collection efforts Fix: Add leader check for metrics collect correctly #1432

Special notes for your reviewer:

The holderIdentity field is used for identification within the Lease. The kube-scheduler constructs the holderIdentity using the hostname. Taking version 1.34.2 as an example:

	// add a uniquifier so that two processes on the same host don't accidentally both become active
	id := hostname + "_" + string(uuid.NewUUID())

This PR determines whether the current vgpu-scheduler-extender container belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists in Lease.Spec.hoderIdentity.

Before executing RegisterFromNodeAnnotation each time, the scheduler must first determine whether it is the leader, ensuring only the leader instance initiates handshakes with nodes (and their device plugins).
Notify the current instance via a channel that it has been elected leader, and immediately execute RegisterFromNodeAnnotation even if the scheduled interval has not yet elapsed.
By defining the container's readinessProbe, ensure only the leader instance joins the hami-scheduler Service endpoints, preventing duplicate metric collection by other instances. This also affects webhooks, with only the primary instance's webhook now executing.

The hami-scheduler-newversion configuration employs SchedulerConfig to ensure that the kube-scheduler within each Pod exclusively accesses the local vgpu-scheduler-extender. Consequently, there is no concern regarding interference with calls to the Filter/ and Bind/ endpoints.

Does this PR introduce a user-facing change?: No

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

1. update notify channel with buffer of size 1 in leaderManager 2. init notify chan for dummy Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot · 2025-12-22T12:46:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-12-22T12:46:35Z

Summary of Changes

Hello @peachest, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the high availability of the hami-scheduler by integrating a Kubernetes-native leader election mechanism. This ensures that in a multi-replica setup, only a single instance of the vgpu-scheduler-extender acts as the primary, preventing conflicts like redundant device plugin handshakes and duplicate metric collection. The changes include new command-line flags, a dedicated leader election utility, and a readiness probe to manage service endpoint exposure based on leadership status, making the scheduler more resilient and efficient.

Highlights

Leader Election Implementation: Introduced a robust leader election mechanism for the vgpu-scheduler-extender using Kubernetes Lease objects to ensure high availability.
Problem Resolution: Addresses critical issues in multi-replica deployments, such as preventing duplicate handshakes with device plugins and redundant metric collection by ensuring only the leader performs these actions.
Dynamic Leadership Notification: Implemented a channel-based notification system to immediately inform an instance upon its election as leader, allowing it to promptly execute critical initialization tasks like RegisterFromNodeAnnotation.
Readiness Probe Integration: Added a readiness probe (/readyz endpoint) to the scheduler's deployment, ensuring that only the active leader instance is exposed via the hami-scheduler Service endpoints, thereby preventing non-leader instances from receiving traffic for webhooks and metrics.
Configuration Options: Exposed new Helm chart values (leader-elect, leader-elect-resource-name, leader-elect-resource-namespace) to enable and configure the leader election feature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a master-slave high availability mechanism for the vgpu-scheduler-extender using Kubernetes' leader election. The changes are comprehensive, adding new CLI flags, a readiness probe, and a new leaderelection utility package to observe lease objects and determine leadership. The approach of piggybacking on kube-scheduler's leader election lease is clever.

However, I've found a few issues that need to be addressed. There's a critical compilation error in the new leaderelection package due to an incorrect interface implementation. Additionally, there are a couple of bugs related to configuration in the Helm chart and scheduler initialization. I've also included suggestions to improve the robustness of the leader election logic by checking for lease validity and using non-blocking channel sends.

pkg/util/leaderelection/leaderelection.go

charts/hami/templates/scheduler/deployment.yaml

pkg/scheduler/scheduler.go

gemini-code-assist · 2025-12-22T12:48:54Z

pkg/util/leaderelection/leaderelection.go

+func (m *leaderManager) isLeaseValid(now time.Time) bool {
+	return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now)
+}
+
+func (m *leaderManager) IsLeader() bool {
+	m.leaseLock.RLock()
+	defer m.leaseLock.RUnlock()
+
+	if m.observedLease == nil {
+		return false
+	}
+
+	// TODO: should we check valid lease here?
+	return m.isHolderOf(m.observedLease)
+}


The function isLeaseValid is currently unused and is susceptible to a nil-pointer dereference if observedLease.Spec.LeaseDurationSeconds is nil. The IsLeader function includes a TODO to check for lease validity, which is a crucial check for correctness. Without it, the manager might incorrectly report being the leader if lease-related watch events are delayed.

I recommend making isLeaseValid nil-safe and integrating it into IsLeader to ensure the lease is not expired.

func (m *leaderManager) isLeaseValid(now time.Time) bool { if m.observedLease == nil || m.observedLease.Spec.LeaseDurationSeconds == nil { return false } return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now) } func (m *leaderManager) IsLeader() bool { m.leaseLock.RLock() defer m.leaseLock.RUnlock() if m.observedLease == nil { return false } if !m.isLeaseValid(time.Now()) { return false } return m.isHolderOf(m.observedLease) }

cmd/scheduler/main.go

gemini-code-assist · 2025-12-22T12:48:54Z

pkg/util/leaderelection/leaderelection.go

+	m.setObservedRecord(lease)
+	// Notify if we are the leader from the very begging
+	if m.isHolderOf(lease) {
+		m.leaderNotify <- struct{}{}


The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.

select { case m.leaderNotify <- struct{}{}: default: }

gemini-code-assist · 2025-12-22T12:48:54Z

pkg/util/leaderelection/leaderelection.go

+		}
+
+		if !m.isHolderOf(oldLease) {
+			m.leaderNotify <- struct{}{}


The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.

select { case m.leaderNotify <- struct{}{}: default: }

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

…cheduler.admissionWebhook.enabled` Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

… correct call when failed Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

codecov · 2025-12-24T02:43:58Z

Codecov Report

❌ Patch coverage is 68.31683% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/scheduler/scheduler.go	21.42%	9 Missing and 2 partials ⚠️
pkg/util/leaderelection/leaderelection.go	85.71%	7 Missing and 4 partials ⚠️
pkg/scheduler/routes/route.go	0.00%	10 Missing ⚠️

Flag	Coverage Δ
unittests	`51.25% <68.31%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/scheduler/config/config.go	`78.45% <ø> (ø)`
pkg/scheduler/routes/route.go	`0.00% <0.00%> (ø)`
pkg/scheduler/scheduler.go	`50.96% <21.42%> (-0.41%)`	⬇️
pkg/util/leaderelection/leaderelection.go	`85.71% <85.71%> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

Shouren · 2025-12-24T03:08:55Z

pkg/scheduler/scheduler.go

+		// TODO: may be we should lock node when we are doing register.
+		// Only do registration when we are leader
+		if !s.leaderManager.IsLeader() {
+			continue


@peachest I think it is too late to do registration when we are leader. When the kube-scheduler in the same Pod becomes the leader and sends a filter request to this extender, there is no guarantee this the registration will be done before handling that request.

@Shouren I think this check is for addressing

Multiple instances repeatedly initiating handshakes with device plugins on nodes

But #1499 refine the handshake logic, so I'm not sure if the new sync logic still have this problem. If not, we don't need this section any more.

And for the requests, the leader control is delegated to the k8s-scheduler. Only the leader k8s-scheduler will pass requests to extender. So the extender don't need to care about that. As the description:

The holderIdentity field is used for identification within the Lease. The kube-scheduler constructs the holderIdentity using the hostname. Taking version 1.34.2 as an example:

// add a uniquifier so that two processes on the same host don't accidentally both become active id := hostname + "_" + string(uuid.NewUUID())

This PR determines whether the current vgpu-scheduler-extender container belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists in Lease.Spec.hoderIdentity

I think what @Shouren means is that we should make sure finish doing RegisterFromNodeAnnotation at least one time before doing filter when the Filter/ endpoint is call.

I am tring to implement this by adding a synced flag and just wait for syncing before filtering. It's reasonable that, when we finished RegisterFromNodeAnnotation, we set synced as true, and set synced to false when we lost leadership because we won't do any register from now on.

Is there any better idea?

@FouoF @peachest Let me explain it in detail, the check is achieved by following line

health, needUpdate := devInstance.CheckHealth(devhandsk, val)

and the implementation of CheckHealth differs for each devInstance, #1499 refactor the implementation of CheckHealth for NvidiaGPUDevices that it does not call device.CheckHealth any more. But there are four devInstances(kunlun, hygon, iluvatar & ascend) still calling device.CheckHealth, so we need this check.

However, the RegisterFromNodeAnnotations function will execute the following line

s.addNode(val.Name, nodeInfo)

to add information of node to nodeManager of scheduler when initializing scheduler or node needs to be updated. My concern is that current implementation skips this step before the kube-scheduler becomes the leader and it might be too late when a filter request comes but the nodes in that request has not been added to nodeManager.

Perhaps we need a new variable to indicate whether handshake is required, but i'm not sure if it will bring new issues.

Is there any better idea?

@peachest Maybe we can add a new parameter to device.CheckHealth to control the behavior of handshake, so that only the leader(Let's put the fencing issue aside for now) can patch the annotations of node and keep the rest of code works as before. But i have to check the code to see if there are any conflicts with this implementation.

archlitchi · 2026-01-04T06:52:43Z

looks like we have a problem with e2e

houyuxi added 9 commits December 22, 2025 20:10

add flags for leaderelection

48a6154

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

add leader manager

6ec9926

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

feat: implement Scheduler with LeaderManager

d365a0e

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

feat: implement readyz api and readiness probe

258432c

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

refactor: rename hostName param or field to hostname

0def460

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

fix: notify channel

71b3600

1. update notify channel with buffer of size 1 in leaderManager 2. init notify chan for dummy Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

refactor: rename isHolder method to isHolderOf

5865284

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

feat: add observedTime field and isLeaseValid method

0b2a398

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

test: add test for leaderelection package

5d87864

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot bot added kind/feature new function dco-signoff: yes labels Dec 22, 2025

hami-robot bot requested review from chaunceyjiang and wawa0210 December 22, 2025 12:46

github-actions bot removed the kind/feature new function label Dec 22, 2025

hami-robot bot added the size/L label Dec 22, 2025

gemini-code-assist bot reviewed Dec 22, 2025

View reviewed changes

peachest changed the title ~~Implementing master-slave high availability using leader-election~~ Implementing leader-follower high availability using leader-election Dec 23, 2025

hami-robot bot added dco-signoff: no size/XL and removed dco-signoff: yes size/L labels Dec 23, 2025

peachest force-pushed the fix/multi-instance branch from 88421ba to 9888f5d Compare December 23, 2025 06:21

hami-robot bot added dco-signoff: yes dco-signoff: no and removed dco-signoff: no dco-signoff: yes labels Dec 23, 2025

houyuxi added 2 commits December 23, 2025 14:27

test: add tests for deleting lease

bbd2d68

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

fix: update readinessProbe using http or https controled by value `.s…

666800d

…cheduler.admissionWebhook.enabled` Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

houyuxi added 5 commits December 23, 2025 14:27

fix: initializing leader manager with correct param

799fbfd

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

add more comments

9888f5d

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

add check for empty hostname

c9711e9

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

test: add GinkgoHelper in assert helper to redirect to point to the…

7b19a26

… correct call when failed Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

refactor: update fields order of Scheduler type

31e364d

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

peachest force-pushed the fix/multi-instance branch from f101410 to 31e364d Compare December 24, 2025 02:20

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Dec 24, 2025

add license headers

4bf3152

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

peachest had a problem deploying to nvidia December 24, 2025 02:46 — with GitHub Actions Failure

Shouren reviewed Dec 24, 2025

View reviewed changes

peachest had a problem deploying to nvidia January 4, 2026 06:12 — with GitHub Actions Failure

peachest had a problem deploying to nvidia January 4, 2026 06:36 — with GitHub Actions Failure

Implementing leader-follower high availability using leader-election #1553

Are you sure you want to change the base?

Implementing leader-follower high availability using leader-election #1553

Conversation

peachest commented Dec 22, 2025

Uh oh!

hami-robot bot commented Dec 22, 2025

Uh oh!

gemini-code-assist bot commented Dec 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Shouren Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

FouoF Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

peachest Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shouren Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shouren Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

archlitchi commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Dec 24, 2025 •

edited

Loading

peachest Dec 25, 2025 •

edited

Loading

Shouren Dec 25, 2025 •

edited

Loading