Skip to content

Conversation

@martenole
Copy link
Contributor

To send CCDB object without relying on the CCDB manager.

Not sure if we want to merge the second commit. This would enable the calibration per default in the FST.

@martenole martenole requested review from a team, bazinski, jolopezl and tdietel as code owners April 27, 2022 18:14
@shahor02
Copy link
Collaborator

Not sure if we want to merge the second commit. This would enable the calibration per default in the FST.

Actually, since the workflows from calib-workflow.sh are attached to the (sync) processing workflow, running there aggregators (e.g. vdrift and meanvertex) makes sense only in the FST. On the EPN they should be substituted by the output proxies sending the data requested by the aggregators + separate workflow started on the aggregator node.
Since at the moment we are not running online calibrations on the EPNs, I think the current code is fine, but we should eventually substitute these aggregators by the corresponding output proxy and move the aggregators to separate script, like in O2DPG/DATA/production/calib. Then they can also be started in the FST as separate workflows on the same PC.
Don't know what plans @davidrohr has on this.

@davidrohr
Copy link
Collaborator

Fine with me in general, I am also working on some automation to be able to run the aggregators locally, though still via 2 proxies, in order to run the full calib workflow in the FST.
Just, before merging I want to test the performance and stability in 8 GPU FST on the EPNs, will do that next week.

@chiarazampolli
Copy link
Collaborator

Ciao @davidrohr ,

I am also testing the aggregator with calibrations run locally. Just to not duplicate the efforts, you are just working on the FST framework or also integrating the calibrations? I am working on the latter.

Chiara

@davidrohr
Copy link
Collaborator

Just working on the framework, not on any individual aggregators

@chiarazampolli
Copy link
Collaborator

Ok, I need to fix something, but then I will commit the first version of the scripts (hopefully tomorrow), so that you can take a look if it fits.
Thanks!!!

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR the full system test on the EPN with 8 GPUs produces backpressure, and I see that the SHM runs full. (I retried the exact same software without this PR and then it works.)

To reproduce, on the EPN:

GEN_TOPO_WORKDIR=/home/drohr/tmp4/ QC_REDIRECT_MERGER_TO_LOCALHOST=1 WORKFLOW_PARAMETERS=QC,CALIB,EVENT_DISPLAY WORKFLOW_DETECTORS_QC=FT0,FV0,FDD,ZDC,TOF,ITS,MFT,MID,MCH,EMC,PHS,CPV CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.benchmarkMemoryRegistration=1;" TFDELAY=5 NTIMEFRAMES=100 $O2_ROOT/prodtests/full-system-test/start_tmux.sh dd

To check the SHM:

fairmq-shmmonitor -v -i

@martenole
Copy link
Contributor Author

Thanks a lot for checking @davidrohr I am having a look

@martenole
Copy link
Contributor Author

So the backpressure was traced back to a bug in the workflow. The output from the TrackBasedCalib was sent only for every 200th TF, but the output was not declared sporadic. This is now fixed and with the current status this PR does not cause backpressure anymore on the FST on the EPNs.
Currently I send the output for every TF. I tried also to declare it Lifetime::Sporadic but this then again lead to backpressure.

@shahor02
Copy link
Collaborator

shahor02 commented May 5, 2022

So the backpressure was traced back to a bug in the workflow. The output from the TrackBasedCalib was sent only for every 200th TF, but the output was not declared sporadic. This is now fixed and with the current status this PR does not cause backpressure anymore on the FST on the EPNs. Currently I send the output for every TF. I tried also to declare it Lifetime::Sporadic but this then again lead to backpressure.

yes, the Sporadic is not working as expected, @ktf is checking

@martenole
Copy link
Contributor Author

Hi @davidrohr is it OK for you to merge this as it is now? As mentioned above there is no more backpressure anymore now that the output is sent for every TF.

@davidrohr davidrohr merged commit b0dfbb8 into AliceO2Group:dev May 6, 2022
if [[ $BEAMTYPE != "cosmic" ]]; then
has_detector_calib TPC && has_detectors TPC ITS TRD TOF && add_W o2-tpc-scdcalib-interpolation-workflow "$DISABLE_ROOT_OUTPUT --disable-root-input --pipeline $(get_N tpc-track-interpolation TPC REST)" "$ITSMFT_FILES"
has_detector_calib ITS && has_detectors ITS && has_detectors_reco ITS && has_detector_matching PRIMVTX && [[ ! -z "$VERTEXING_SOURCES" ]] && add_W o2-calibration-mean-vertex-calibration-workflow
has_detector_calib TRD && has_detector ITS TPC TRD && add_W o2-calibration-trd-vdrift-exb
Copy link
Collaborator

@chiarazampolli chiarazampolli May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it should be has_detectors, plural, no? Anyway, see my other comment #8668 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Thanks! I see you have already corrected that in AliceO2Group/O2DPG#375

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did, but of course the other PR is a complete change with respect to yours, and your calibration won't run for now in the FST. Is it ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me in principle yes. If I understand correctly with your changes none of the aggregators would be tested in the FST per default. I am wondering if we should not test them in the CI or then at least enable them when we install a new O2 version at P2 and let the FST run for a longer time to check the stability. But here I let @davidrohr comment (probably better directly in #8736)

@chiarazampolli
Copy link
Collaborator

Hello,

Please see:

AliceO2Group/O2DPG#375
#8736

Chiara

@chiarazampolli chiarazampolli mentioned this pull request May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants