Skip to content

Conversation

@ccardenosa
Copy link
Contributor

@ccardenosa ccardenosa commented Dec 22, 2025

Problem:

Multiple Telco KPI Prow jobs compete for the same baremetal host. Lower OCP version jobs can block higher version jobs for extended periods, delaying critical testing for newer releases.

Solution:

Implement a lock-free preemption mechanism where higher OCP version jobs can signal lower version jobs to quit, freeing the baremetal host sooner.

How it works:

  1. WAITING PHASE: Each job creates a unique waiting file on the bastion BEFORE attempting to acquire the lock: <lock>.waiting.<nanosecond_timestamp>.<ocp_version>.
    Example: spoke-baremetal-50-7c-6f-5c-47-8c.lock.waiting.1766568440841947242.4.22.
    This ensures the job's presence is visible even if it immediately gets the lock.

  2. LOCK ACQUISITION: When a job acquires the lock, it checks for higher priority waiters BEFORE removing its own waiting file (deferred deletion). If higher priority found, it releases the lock and keeps its waiting file for retry. Only when no higher priority is found does it remove its waiting file.

  3. PERIODIC CHECKS: While holding the lock, the job periodically checks for higher priority waiters at key points:

  • cluster-install: every QUIT_CHECK_INTERVAL iterations (default: 3)
  • oslat test: before running tests
  • cpu-util test: before running tests
  1. QUIT MODES:
  • graceful (exit 0): Used by test steps. Allows remaining steps like PTP reporting to complete. Job exits cleanly.
  • force (exit 1): Used by cluster-install. If installation is interrupted, remaining steps are meaningless. Job aborts immediately.
  1. CLEANUP: Each job always removes its own waiting file during cleanup, regardless of whether it acquired the lock. This prevents orphaned files.

Priority logic:

  • ONLY the OCP version determines priority (e.g., 4.22 > 4.20)
  • The nanosecond timestamp is NOT used for priority decisions
  • Same-version jobs (e.g., two 4.22 jobs) compete equally for the lock
  • Timestamp is used solely for: (1) unique filenames, (2) self-cleanup

Key benefits:

  • Lock-free: No shared mutable state, each job manages its own file
  • Race-safe: Nanosecond timestamps ensure unique filenames
  • Self-cleaning: Jobs clean up only their own files
  • Configurable: QUIT_CHECK_INTERVAL controls check frequency

New shared functions:

  • extract_ocp_version: Gets version from JOB_NAME
  • create_waiting_request_file: Creates unique waiting file
  • remove_own_waiting_file: Removes job's waiting file
  • check_for_higher_priority_waiter: Scans waiting files for higher version
  • should_quit: Determines if quit is needed
  • check_for_quit: Main entry point (supports graceful/force modes)

@openshift-ci openshift-ci bot requested review from dgoodwin and neisw December 22, 2025 14:41
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025
@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from 4948bce to 22ecf85 Compare December 22, 2025 14:54
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch 2 times, most recently from 793a663 to d2985aa Compare December 23, 2025 11:29
@openshift-ci-robot
Copy link
Contributor

@ccardenosa, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: test `telcov10n-shared-functions` has `commands` containing `trap` command, but test step is missing grace_period
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from d2985aa to 589129e Compare December 23, 2025 11:46
@openshift-ci-robot
Copy link
Contributor

@ccardenosa, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: test `telcov10n-shared-functions` has `commands` containing `trap` command, but test step is missing grace_period
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from 589129e to a761dfb Compare December 23, 2025 12:56
@ccardenosa ccardenosa changed the title feat(telco-kpi): add graceful quit priority mechanism for competing jobs feat(telco-kpi): add lock-free graceful quit priority mechanism for competing jobs Dec 23, 2025
@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch 3 times, most recently from 4c8f4c1 to 0b2d3ae Compare December 23, 2025 14:33
@openshift-ci-robot
Copy link
Contributor

@ccardenosa, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: test `telcov10n-shared-functions` has `commands` containing `trap` command, but test step is missing grace_period
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch 6 times, most recently from ca67d8b to 318722a Compare December 23, 2025 15:15
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from ef7bd59 to 9896c65 Compare December 24, 2025 09:50
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/assign @eifrach

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from 18ecbdd to c8f7c83 Compare December 24, 2025 11:57
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ccardenosa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 24, 2025
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

…iority

Problem:
Multiple Telco KPI Prow jobs compete for the same baremetal host. Lower OCP
version jobs can block higher version jobs for extended periods, delaying
critical testing for newer releases.

Solution:
Implement a lock-free preemption mechanism where higher OCP version jobs can
signal lower version jobs to quit, freeing the baremetal host sooner.

How it works:
1. WAITING PHASE: Each job creates a unique waiting file on the bastion BEFORE
   attempting to acquire the lock: <lock>.waiting.<nanosecond_timestamp>.<ocp_version>
   Example: spoke-baremetal-50-7c-6f-5c-47-8c.lock.waiting.1766568440841947242.4.22
   This ensures the job's presence is visible even if it immediately gets the lock.

2. LOCK ACQUISITION: When a job acquires the lock, it checks for higher priority
   waiters BEFORE removing its own waiting file (deferred deletion). If higher
   priority found, it releases the lock and keeps its waiting file for retry.
   Only when no higher priority is found does it remove its waiting file.

3. PERIODIC CHECKS: While holding the lock, the job periodically checks for
   higher priority waiters at key points:
   - cluster-install: every QUIT_CHECK_INTERVAL iterations (default: 3)
   - oslat test: before running tests
   - cpu-util test: before running tests

4. QUIT MODES:
   - 'graceful' (exit 0): Used by test steps. Allows remaining steps like
     PTP reporting to complete. Job exits cleanly.
   - 'force' (exit 1): Used by cluster-install. If installation is interrupted,
     remaining steps are meaningless. Job aborts immediately.

5. CLEANUP: Each job always removes its own waiting file during cleanup,
   regardless of whether it acquired the lock. This prevents orphaned files.

Key benefits:
- Lock-free: No shared mutable state, each job manages its own file
- Race-safe: Nanosecond timestamps ensure unique filenames
- Deferred deletion: Waiting file persists until validation passes
- Self-cleaning: Jobs clean up only their own files
- Configurable: QUIT_CHECK_INTERVAL controls check frequency

New shared functions:
- extract_ocp_version: Gets version from JOB_NAME
- create_waiting_request_file: Creates unique waiting file
- remove_own_waiting_file: Removes job's waiting file
- check_for_higher_priority_waiter: Scans waiting files for higher version
- should_quit: Determines if quit is needed
- check_for_quit: Main entry point (supports graceful/force modes)

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
@ccardenosa ccardenosa force-pushed the feat/ztp-left-shifting-improve-baremetal-server-resource-utilization branch from c8f7c83 to 53e6bf8 Compare December 24, 2025 13:31
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@ccardenosa: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.18-telcov10n-metal-single-node-spoke N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.19-telcov10n-metal-single-node-spoke N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis N/A periodic Registry content changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 24, 2025

@ccardenosa: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis 53e6bf8 link unknown /pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis
ci/rehearse/periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis 53e6bf8 link unknown /pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis
ci/rehearse/periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis 53e6bf8 link unknown /pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.20-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.21-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-nightly-4.22-telcov10n-metal-single-node-spoke-kpis

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants