-
Notifications
You must be signed in to change notification settings - Fork 444
Implementing leader-follower high availability using leader-election #1553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
1. update notify channel with buffer of size 1 in leaderManager 2. init notify chan for dummy Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: peachest The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @peachest, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the high availability of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a master-slave high availability mechanism for the vgpu-scheduler-extender using Kubernetes' leader election. The changes are comprehensive, adding new CLI flags, a readiness probe, and a new leaderelection utility package to observe lease objects and determine leadership. The approach of piggybacking on kube-scheduler's leader election lease is clever.
However, I've found a few issues that need to be addressed. There's a critical compilation error in the new leaderelection package due to an incorrect interface implementation. Additionally, there are a couple of bugs related to configuration in the Helm chart and scheduler initialization. I've also included suggestions to improve the robustness of the leader election logic by checking for lease validity and using non-blocking channel sends.
| func (m *leaderManager) isLeaseValid(now time.Time) bool { | ||
| return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now) | ||
| } | ||
|
|
||
| func (m *leaderManager) IsLeader() bool { | ||
| m.leaseLock.RLock() | ||
| defer m.leaseLock.RUnlock() | ||
|
|
||
| if m.observedLease == nil { | ||
| return false | ||
| } | ||
|
|
||
| // TODO: should we check valid lease here? | ||
| return m.isHolderOf(m.observedLease) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function isLeaseValid is currently unused and is susceptible to a nil-pointer dereference if observedLease.Spec.LeaseDurationSeconds is nil. The IsLeader function includes a TODO to check for lease validity, which is a crucial check for correctness. Without it, the manager might incorrectly report being the leader if lease-related watch events are delayed.
I recommend making isLeaseValid nil-safe and integrating it into IsLeader to ensure the lease is not expired.
func (m *leaderManager) isLeaseValid(now time.Time) bool {
if m.observedLease == nil || m.observedLease.Spec.LeaseDurationSeconds == nil {
return false
}
return m.observedTime.Add(time.Second * time.Duration(*m.observedLease.Spec.LeaseDurationSeconds)).After(now)
}
func (m *leaderManager) IsLeader() bool {
m.leaseLock.RLock()
defer m.leaseLock.RUnlock()
if m.observedLease == nil {
return false
}
if !m.isLeaseValid(time.Now()) {
return false
}
return m.isHolderOf(m.observedLease)
}| m.setObservedRecord(lease) | ||
| // Notify if we are the leader from the very begging | ||
| if m.isHolderOf(lease) { | ||
| m.leaderNotify <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.
select {
case m.leaderNotify <- struct{}{}:
default:
}| } | ||
|
|
||
| if !m.isHolderOf(oldLease) { | ||
| m.leaderNotify <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The send to m.leaderNotify is a blocking operation. Since leaderNotify is a buffered channel of size 1, if a new leadership event occurs before the consumer has processed the previous one, this send will block indefinitely, potentially deadlocking the event handler goroutine. Using a non-blocking send would make this more robust.
select {
case m.leaderNotify <- struct{}{}:
default:
}88421ba to
9888f5d
Compare
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
…cheduler.admissionWebhook.enabled` Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
… correct call when failed Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
f101410 to
31e364d
Compare
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 7 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
| // TODO: may be we should lock node when we are doing register. | ||
| // Only do registration when we are leader | ||
| if !s.leaderManager.IsLeader() { | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peachest I think it is too late to do registration when we are leader. When the kube-scheduler in the same Pod becomes the leader and sends a filter request to this extender, there is no guarantee this the registration will be done before handling that request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shouren I think this check is for addressing
Multiple instances repeatedly initiating handshakes with device plugins on nodes
But #1499 refine the handshake logic, so I'm not sure if the new sync logic still have this problem. If not, we don't need this section any more.
And for the requests, the leader control is delegated to the k8s-scheduler. Only the leader k8s-scheduler will pass requests to extender. So the extender don't need to care about that. As the description:
The holderIdentity field is used for identification within the Lease. The kube-scheduler constructs the holderIdentity using the hostname. Taking version 1.34.2 as an example:
// add a uniquifier so that two processes on the same host don't accidentally both become active
id := hostname + "_" + string(uuid.NewUUID())
This PR determines whether the current vgpu-scheduler-extender container belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists in Lease.Spec.hoderIdentity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @Shouren means is that we should make sure finish doing RegisterFromNodeAnnotation at least one time before doing filter when the Filter/ endpoint is call.
I am tring to implement this by adding a synced flag and just wait for syncing before filtering. It's reasonable that, when we finished RegisterFromNodeAnnotation, we set synced as true, and set synced to false when we lost leadership because we won't do any register from now on.
Is there any better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FouoF @peachest Let me explain it in detail, the check is achieved by following line
health, needUpdate := devInstance.CheckHealth(devhandsk, val)
and the implementation of CheckHealth differs for each devInstance, #1499 refactor the implementation of CheckHealth for NvidiaGPUDevices that it does not call device.CheckHealth any more. But there are four devInstances(kunlun, hygon, iluvatar & ascend) still calling device.CheckHealth, so we need this check.
However, the RegisterFromNodeAnnotations function will execute the following line
s.addNode(val.Name, nodeInfo)
to add information of node to nodeManager of scheduler when initializing scheduler or node needs to be updated. My concern is that current implementation skips this step before the kube-scheduler becomes the leader and it might be too late when a filter request comes but the nodes in that request has not been added to nodeManager.
Perhaps we need a new variable to indicate whether handshake is required, but i'm not sure if it will bring new issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better idea?
@peachest Maybe we can add a new parameter to device.CheckHealth to control the behavior of handshake, so that only the leader(Let's put the fencing issue aside for now) can patch the annotations of node and keep the rest of code works as before. But i have to check the code to see if there are any conflicts with this implementation.
|
looks like we have a problem with e2e |
What type of PR is this?
/kind feature
What this PR does / why we need it:
The current HAMi implementation lacks sufficient support for multi-replica high availability.
Whilst the kube-scheduler has implemented master-slave election via Kubernetes' built-in leader election mechanism, the vgpu-scheduler-extender has not. Direct deployment of multiple replicas may encounter the following issues:
Special notes for your reviewer:
The
holderIdentityfield is used for identification within the Lease. Thekube-schedulerconstructs theholderIdentityusing the hostname. Taking version 1.34.2 as an example:This PR determines whether the current
vgpu-scheduler-extendercontainer belongs to Leader instance by acquiring the lease and checking whether the current Pod's hostname prefix exists inLease.Spec.hoderIdentity.RegisterFromNodeAnnotationeach time, the scheduler must first determine whether it is the leader, ensuring only the leader instance initiates handshakes with nodes (and their device plugins).RegisterFromNodeAnnotationeven if the scheduled interval has not yet elapsed.The hami-scheduler-newversion configuration employs SchedulerConfig to ensure that the kube-scheduler within each Pod exclusively accesses the local vgpu-scheduler-extender. Consequently, there is no concern regarding interference with calls to the
Filter/andBind/endpoints.Does this PR introduce a user-facing change?: No