Optionaly use TF slice instead of TFcounter for CCDB cache validation #14652

shahor02 · 2025-09-04T23:50:02Z

No description provided.

github-actions · 2025-09-04T23:50:16Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

shahor02 · 2025-09-04T23:50:21Z

@ehellbar

alibuild · 2025-09-05T01:03:08Z

Error while checking build/O2/fullCI_slc9 for 5acfb86 at 2025-09-05 03:03:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBFetcherHelper.cxx:202:56: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBHelpers.cxx:286:56: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
/sw/SOURCES/O2/14652-slc9_x86-64/0/Framework/CCDBSupport/src/CCDBHelpers.cxx:389:62: error: 'struct o2::framework::TimingInfo' has no member named 'slice'
ninja: build stopped: subcommand failed.

Full log here.

davidrohr · 2025-09-05T05:30:30Z

If I understand correctly, you check that the timeSlice counter did not jump much and if not, you keep the cached CCDB object.
One should keep in mind that there is no guarantee about how much the TFcounter may jump between time slices.

Particularly in high-load situations, when the MI50 nodes go into backpressure, and the MI100 nodes not, there can easily be a difference in the processing delay of order of 1 minute between TFs arriving at the calib node from different reco nodes.

Thus I think the tfCounter is the much safer choice. If we allow a timeslice difference of let's say 4, and in that range we can have a tfCounter difference of 200, shouldn't we then just allow a tfCounter difference of 200 for CCDB fetching? It should have a similar effect, but at least then we have a real limit for the lifetime of the validity.

shahor02 · 2025-09-05T07:55:44Z

@davidrohr, I know, it is to address this: https://ali-bookkeeping.cern.ch/?page=log-detail&id=134453. We need an effective way to prescale the CCDB queries on the aggregator node and given that the TFCounters arriving quasi-unordered and with the large spread, prescaling with the TFcounter difference is not effective.
I am considering as an alternative to this to simply prescale on the TFconter%counter (will not work on EPNs due to the "aliasing")

shahor02 · 2025-09-05T10:09:01Z

@ehellbar I've tested extra options of this PR as (imposing isOnline mode) on the interpolation workflow.
Showing only grep -E 'TPC/Calib/Pressure|Processing timeslice' results.

default behaviour:

o2-tpc-scdcalib-interpolation-workflow ...
[11:10:17][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF
...
[11:10:18][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:10:20][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
[Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:10:20][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757063417-yfxot3), 
[11:10:21][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:10:21][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414077
[11:10:21][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:10:21][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:10:21][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415802
[11:10:21][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377
[11:10:21][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:10:21][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416951

Old-style prescaling on tfCounter with --condition-tf-per-query-multiplier 3, in this case it does not help since I process data from single EPN, so the TFcounters are staggered by 100s...

[11:13:14][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF, (query for high-rate objects downscaled by 3)
[11:13:15][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:13:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:13:17][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757063594-MUwdfG), 
[11:13:17][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:13:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:13:18][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414077
[11:13:18][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:13:18][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:13:18][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415802
[11:13:18][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:13:18][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377

Using timeslice instead of tfCounter + prescaling --condition-tf-per-query-multiplier 3 --condition-use-slice-for-prescaling

[11:32:00][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF(slice!), (query for high-rate objects downscaled by 3)
...
[11:32:01][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:32:03][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:32:03][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757064720-rrl9vd), 
[11:32:04][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:32:04][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:32:04][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:32:04][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914414652
[11:32:04][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:32:04][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:32:04][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:32:04][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416377
[11:32:04][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:32:04][INFO] Processing timeslice:8, tfCounter:368764, firstTForbit:258788928, runNumber:562862, creation:1746914418004, action:0
...

Prescaling on tfCounter%|prescaling| with --condition-tf-per-query-multiplier -3:

[11:35:12][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 2147483647 TF, (query downscaled as TFcounter%3)
...
[11:35:14][INFO] Processing timeslice:0, tfCounter:367148, firstTForbit:258737216, runNumber:562862, creation:1746914413403, action:0
[11:35:15][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914412928
Info in <TJAlienFile::Open>: Accessing file /alice/data/CCDB/TPC/Calib/Pressure/02/31068/c45cff01-7e87-11f0-b2b3-808de0f5250c in SE <ALICE::CERN::OCDB>
[11:35:15][INFO] ccdb reads http://alice-ccdb.cern.ch/TPC/Calib/Pressure/1746057957000/c45cff01-7e87-11f0-b2b3-808de0f5250c for 1746914412928 (load to memory, agent_id: alicers05a-1757064912-JCO6tG), 
[11:35:17][INFO] Processing timeslice:1, tfCounter:367350, firstTForbit:258743680, runNumber:562862, creation:1746914413980, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914413503
[11:35:17][INFO] Processing timeslice:2, tfCounter:367552, firstTForbit:258750144, runNumber:562862, creation:1746914414555, action:0
[11:35:17][INFO] Processing timeslice:3, tfCounter:367754, firstTForbit:258756608, runNumber:562862, creation:1746914415130, action:0
[11:35:17][INFO] Processing timeslice:4, tfCounter:367956, firstTForbit:258763072, runNumber:562862, creation:1746914415705, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914415227
[11:35:17][INFO] Processing timeslice:5, tfCounter:368158, firstTForbit:258769536, runNumber:562862, creation:1746914416279, action:0
[11:35:17][INFO] Processing timeslice:6, tfCounter:368360, firstTForbit:258776000, runNumber:562862, creation:1746914416853, action:0
[11:35:17][INFO] Processing timeslice:7, tfCounter:368562, firstTForbit:258782464, runNumber:562862, creation:1746914417429, action:0
[11:35:17][DETAIL] Loading TPC/Calib/Pressure for timestamp 1746914416951
[11:35:17][INFO] Processing timeslice:8, tfCounter:368764, firstTForbit:258788928, runNumber:562862, creation:1746914418004, action:0
...

alibuild · 2025-09-08T00:38:12Z

Error while checking build/O2/fullCI_slc9 for b57e509 at 2025-09-08 03:09:

No log files found

Full log here.

davidrohr · 2025-09-08T08:17:07Z

@shahor02 : But TFCounters do not arrive fully unordered, they are shuffled within the processing latency of the EPNs. So if we allow something like a TFCounter difference of +/- 2 minutes or so, wouldn't that work? And it would be more precise and more explicit than a timeslice-based prescaling.

shahor02 · 2025-09-08T08:35:19Z

Why it will be more precise with the update on |T_currentTF - T_lastcheckedTF| > 2min than checking every n-th slice? The 1st will have guaranteed large error (we have objects with 5min validity) but not outliers. The 2nd will have some very rare outliers but much smaller error in average.

davidrohr · 2025-09-08T08:38:40Z

ok, if you want higher update rate for most cases, I would use the logical or of both conditions.
And the check on the tfCounter will prevent large outliers (even if they might be rare).

…TFcounter for CCDB cache validation is N!=0 If --condition-tf-per-query-multiplier value is negative, the prescaling is simply applied to tfCounter%|query_rate| (or timeslice%|query_rate| if --condition-use-slice-for-prescaling is asked) If N>0, then enforce a check if the abs difference between the last checked and current TFCounters (not slices!) exceeds N, even if the slices difference is less than the requested check rate.

shahor02 · 2025-09-08T11:17:02Z

OK, modified the --condition-tf-per-query-multiplier to int.

If it is N>0, then enforce a check if the abs difference between the last checked and current TFCounters (not slices!) exceeds N,
even if the slices difference is less than the requested check rate.

alibuild · 2025-09-08T11:23:57Z

Error while checking build/O2/fullCI_slc9 for 3a25c58 at 2025-09-08 13:23:

No log files found

Full log here.

Regression from #14652

shahor02 requested a review from a team as a code owner September 4, 2025 23:50

shahor02 force-pushed the pr_ccdbh_TFSl branch from 5acfb86 to 432055c Compare September 5, 2025 09:44

shahor02 force-pushed the pr_ccdbh_TFSl branch from 432055c to b57e509 Compare September 5, 2025 10:16

shahor02 force-pushed the pr_ccdbh_TFSl branch from b57e509 to 3a25c58 Compare September 8, 2025 11:15

shahor02 merged commit c1cd2a6 into AliceO2Group:dev Sep 8, 2025
8 of 11 checks passed

singiamtel added a commit that referenced this pull request Sep 9, 2025

Fix macro formatting

f33e8c6

Regression from #14652

singiamtel mentioned this pull request Sep 9, 2025

Fix macro formatting #14661

Merged

Optionaly use TF slice instead of TFcounter for CCDB cache validation #14652

Optionaly use TF slice instead of TFcounter for CCDB cache validation #14652

Uh oh!

Conversation

shahor02 commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

shahor02 commented Sep 4, 2025

Uh oh!

alibuild commented Sep 5, 2025

Uh oh!

davidrohr commented Sep 5, 2025

Uh oh!

shahor02 commented Sep 5, 2025

Uh oh!

shahor02 commented Sep 5, 2025

Uh oh!

alibuild commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidrohr commented Sep 8, 2025

Uh oh!

shahor02 commented Sep 8, 2025

Uh oh!

davidrohr commented Sep 8, 2025

Uh oh!

shahor02 commented Sep 8, 2025

Uh oh!

alibuild commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

alibuild commented Sep 8, 2025 •

edited

Loading