Skip to content

Conversation

@shahor02
Copy link
Collaborator

@shahor02 shahor02 commented Sep 4, 2025

No description provided.

@shahor02 shahor02 requested a review from a team as a code owner September 4, 2025 23:50
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2025

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@shahor02
Copy link
Collaborator Author

shahor02 commented Sep 4, 2025

@ehellbar

@alibuild
Copy link
Collaborator

alibuild commented Sep 5, 2025

Error while checking build/O2/fullCI_slc9 for 5acfb86 at 2025-09-05 03:03:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBFetcherHelper.cxx:202:56: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBHelpers.cxx:286:56: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBHelpers.cxx:389:62: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
ninja: build stopped: subcommand failed.

Full log here.

@davidrohr
Copy link
Collaborator

If I understand correctly, you check that the timeSlice counter did not jump much and if not, you keep the cached CCDB object.
One should keep in mind that there is no guarantee about how much the TFcounter may jump between time slices.

Particularly in high-load situations, when the MI50 nodes go into backpressure, and the MI100 nodes not, there can easily be a difference in the processing delay of order of 1 minute between TFs arriving at the calib node from different reco nodes.

Thus I think the tfCounter is the much safer choice. If we allow a timeslice difference of let's say 4, and in that range we can have a tfCounter difference of 200, shouldn't we then just allow a tfCounter difference of 200 for CCDB fetching? It should have a similar effect, but at least then we have a real limit for the lifetime of the validity.

@shahor02
Copy link
Collaborator Author

shahor02 commented Sep 5, 2025

@davidrohr, I know, it is to address this: https://ali-bookkeeping.cern.ch/?page=log-detail&id=134453. We need an effective way to prescale the CCDB queries on the aggregator node and given that the TFCounters arriving quasi-unordered and with the large spread, prescaling with the TFcounter difference is not effective.
I am considering as an alternative to this to simply prescale on the TFconter%counter (will not work on EPNs due to the "aliasing")

@shahor02
Copy link
Collaborator Author

shahor02 commented Sep 5, 2025

@ehellbar I've tested extra options of this PR as (imposing isOnline mode) on the interpolation workflow.
Showing only grep -E 'TPC/Calib/Pressure|Processing timeslice' results.

  1. default behaviour:
o2-tpc-scdcalib-interpolation-workflow ...
[11:10:17][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF
...
[11:10:18][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:10:20][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
[Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:10:20][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757063417-yfxot3), 
[11:10:21][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:10:21][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414077
[11:10:21][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:10:21][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:10:21][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415802
[11:10:21][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377
[11:10:21][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416951
  1. Old-style prescaling on tfCounter with --condition-tf-per-query-multiplier 3, in this case it does not help since I process data from single EPN, so the TFcounters are staggered by 100s...
[11:13:14][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF, (query for high-rate objects downscaled by 3)
[11:13:15][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:13:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:13:17][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757063594-MUwdfG), 
[11:13:17][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:13:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:13:18][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414077
[11:13:18][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:13:18][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:13:18][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415802
[11:13:18][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377
  1. Using timeslice instead of tfCounter + prescaling --condition-tf-per-query-multiplier 3 --condition-use-slice-for-prescaling
[11:32:00][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF(slice!), (query for high-rate objects downscaled by 3)
...
[11:32:01][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:32:03][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:32:03][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757064720-rrl9vd), 
[11:32:04][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:32:04][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:32:04][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:32:04][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:32:04][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:32:04][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:32:04][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:32:04][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377
[11:32:04][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:32:04][INFO] Processing timeslice:8, tfCounter:368764, firstTForbit:258788928, runNumber:562862, creation:1746914418004, action:0
...
  1. Prescaling on tfCounter%|prescaling| with --condition-tf-per-query-multiplier -3:
[11:35:12][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF, (query downscaled as TFcounter%3)
...
[11:35:14][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:35:15][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:35:15][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757064912-JCO6tG), 
[11:35:17][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:35:17][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:35:17][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:35:17][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:35:17][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:35:17][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:35:17][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416951
[11:35:17][INFO] Processing timeslice:8, tfCounter:368764, firstTForbit:258788928, runNumber:562862, creation:1746914418004, action:0
...

@alibuild
Copy link
Collaborator

alibuild commented Sep 8, 2025

Error while checking build/O2/fullCI_slc9 for b57e509 at 2025-09-08 03:09:

No log files found

Full log here.

@davidrohr
Copy link
Collaborator

@shahor02 : But TFCounters do not arrive fully unordered, they are shuffled within the processing latency of the EPNs. So if we allow something like a TFCounter difference of +/- 2 minutes or so, wouldn't that work? And it would be more precise and more explicit than a timeslice-based prescaling.

@shahor02
Copy link
Collaborator Author

shahor02 commented Sep 8, 2025

Why it will be more precise with the update on |T_currentTF - T_lastcheckedTF| > 2min than checking every n-th slice? The 1st will have guaranteed large error (we have objects with 5min validity) but not outliers. The 2nd will have some very rare outliers but much smaller error in average.

@davidrohr
Copy link
Collaborator

ok, if you want higher update rate for most cases, I would use the logical or of both conditions.
And the check on the tfCounter will prevent large outliers (even if they might be rare).

…TFcounter for CCDB cache validation is N!=0

If --condition-tf-per-query-multiplier value is negative, the prescaling is simply
applied to tfCounter%|query_rate| (or timeslice%|query_rate| if --condition-use-slice-for-prescaling is asked)

If N>0, then enforce a check if the abs difference between the last checked and current TFCounters (not slices!) exceeds N,
even if the slices difference is less than the requested check rate.
@shahor02
Copy link
Collaborator Author

shahor02 commented Sep 8, 2025

OK, modified the --condition-tf-per-query-multiplier to int.

If it is N>0, then enforce a check if the abs difference between the last checked and current TFCounters (not slices!) exceeds N,
even if the slices difference is less than the requested check rate.

@alibuild
Copy link
Collaborator

alibuild commented Sep 8, 2025

Error while checking build/O2/fullCI_slc9 for 3a25c58 at 2025-09-08 13:23:

No log files found

Full log here.

@shahor02 shahor02 merged commit c1cd2a6 into AliceO2Group:dev Sep 8, 2025
8 of 11 checks passed
singiamtel added a commit that referenced this pull request Sep 9, 2025
Regression from #14652
@singiamtel singiamtel mentioned this pull request Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants