-
Notifications
You must be signed in to change notification settings - Fork 2k
feat(telco-kpi): add lock-free job preemption based on OCP version priority #72894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
4948bce to
22ecf85
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
793a663 to
d2985aa
Compare
|
@ccardenosa, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
d2985aa to
589129e
Compare
|
@ccardenosa, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
589129e to
a761dfb
Compare
4c8f4c1 to
0b2d3ae
Compare
|
@ccardenosa, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
ca67d8b to
318722a
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
ef7bd59 to
9896c65
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/assign @eifrach |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
18ecbdd to
c8f7c83
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ccardenosa The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
…iority
Problem:
Multiple Telco KPI Prow jobs compete for the same baremetal host. Lower OCP
version jobs can block higher version jobs for extended periods, delaying
critical testing for newer releases.
Solution:
Implement a lock-free preemption mechanism where higher OCP version jobs can
signal lower version jobs to quit, freeing the baremetal host sooner.
How it works:
1. WAITING PHASE: Each job creates a unique waiting file on the bastion BEFORE
attempting to acquire the lock: <lock>.waiting.<nanosecond_timestamp>.<ocp_version>
Example: spoke-baremetal-50-7c-6f-5c-47-8c.lock.waiting.1766568440841947242.4.22
This ensures the job's presence is visible even if it immediately gets the lock.
2. LOCK ACQUISITION: When a job acquires the lock, it checks for higher priority
waiters BEFORE removing its own waiting file (deferred deletion). If higher
priority found, it releases the lock and keeps its waiting file for retry.
Only when no higher priority is found does it remove its waiting file.
3. PERIODIC CHECKS: While holding the lock, the job periodically checks for
higher priority waiters at key points:
- cluster-install: every QUIT_CHECK_INTERVAL iterations (default: 3)
- oslat test: before running tests
- cpu-util test: before running tests
4. QUIT MODES:
- 'graceful' (exit 0): Used by test steps. Allows remaining steps like
PTP reporting to complete. Job exits cleanly.
- 'force' (exit 1): Used by cluster-install. If installation is interrupted,
remaining steps are meaningless. Job aborts immediately.
5. CLEANUP: Each job always removes its own waiting file during cleanup,
regardless of whether it acquired the lock. This prevents orphaned files.
Key benefits:
- Lock-free: No shared mutable state, each job manages its own file
- Race-safe: Nanosecond timestamps ensure unique filenames
- Deferred deletion: Waiting file persists until validation passes
- Self-cleaning: Jobs clean up only their own files
- Configurable: QUIT_CHECK_INTERVAL controls check frequency
New shared functions:
- extract_ocp_version: Gets version from JOB_NAME
- create_waiting_request_file: Creates unique waiting file
- remove_own_waiting_file: Removes job's waiting file
- check_for_higher_priority_waiter: Scans waiting files for higher version
- should_quit: Determines if quit is needed
- check_for_quit: Main entry point (supports graceful/force modes)
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
c8f7c83 to
53e6bf8
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@ccardenosa: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Problem:
Multiple Telco KPI Prow jobs compete for the same baremetal host. Lower OCP version jobs can block higher version jobs for extended periods, delaying critical testing for newer releases.
Solution:
Implement a lock-free preemption mechanism where higher OCP version jobs can signal lower version jobs to quit, freeing the baremetal host sooner.
How it works:
WAITING PHASE: Each job creates a unique waiting file on the bastion BEFORE attempting to acquire the lock:
<lock>.waiting.<nanosecond_timestamp>.<ocp_version>.Example:
spoke-baremetal-50-7c-6f-5c-47-8c.lock.waiting.1766568440841947242.4.22.This ensures the job's presence is visible even if it immediately gets the lock.
LOCK ACQUISITION: When a job acquires the lock, it checks for higher priority waiters BEFORE removing its own waiting file (deferred deletion). If higher priority found, it releases the lock and keeps its waiting file for retry. Only when no higher priority is found does it remove its waiting file.
PERIODIC CHECKS: While holding the lock, the job periodically checks for higher priority waiters at key points:
QUIT_CHECK_INTERVALiterations (default: 3)Priority logic:
Key benefits:
QUIT_CHECK_INTERVALcontrols check frequencyNew shared functions:
extract_ocp_version: Gets version from JOB_NAMEcreate_waiting_request_file: Creates unique waiting fileremove_own_waiting_file: Removes job's waiting filecheck_for_higher_priority_waiter: Scans waiting files for higher versionshould_quit: Determines if quit is neededcheck_for_quit: Main entry point (supports graceful/force modes)