From 7f9f4913273ecb9aa6a608d82dcee0ea5531efc5 Mon Sep 17 00:00:00 2001 From: David Rohr Date: Thu, 24 Apr 2025 22:45:14 +0200 Subject: [PATCH 1/2] Update / add documentation for FST --- .../documentation/dpl-workflow-options.md | 55 ++++++++ .../documentation/env-variables.md | 51 +++++++ .../full-system-test-as-stress-test.md | 33 +++++ .../documentation/full-system-test-setup.md | 124 ++++++++++++++++++ .../full-system-test.md} | 6 +- .../documentation/raw-data-simulation.md | 43 ++++++ 6 files changed, 309 insertions(+), 3 deletions(-) create mode 100644 prodtests/full-system-test/documentation/dpl-workflow-options.md create mode 100644 prodtests/full-system-test/documentation/env-variables.md create mode 100644 prodtests/full-system-test/documentation/full-system-test-as-stress-test.md create mode 100644 prodtests/full-system-test/documentation/full-system-test-setup.md rename prodtests/full-system-test/{README.md => documentation/full-system-test.md} (95%) create mode 100644 prodtests/full-system-test/documentation/raw-data-simulation.md diff --git a/prodtests/full-system-test/documentation/dpl-workflow-options.md b/prodtests/full-system-test/documentation/dpl-workflow-options.md new file mode 100644 index 0000000000000..f79e481ce0723 --- /dev/null +++ b/prodtests/full-system-test/documentation/dpl-workflow-options.md @@ -0,0 +1,55 @@ +# Configuration options +You can use the following options to change the workflow behavior: +- `DDMODE` (default `processing`) : Must be `processing` (synchronous processing) or `processing-disk` (synchronous processing + storing of raw time frames to disk, note that this is the raw time frame not the CTF!). The `DDMODE` `discard` and `disk` are not compatible with the synchronous processing workflow, you must use the `no-processing.desc` workflow instead!. +- `WORKFLOW_DETECTORS` (default `ALL`) : Comma-separated list of detectors for which the processing is enabled. If these are less detectors than participating in the run, data of the other detectors is ignored. If these are more detectors than participating in the run, the processes for the additional detectors will be started but will not do anything. +- `WORKFLOW_DETECTORS_QC` (default `ALL`) : Comma-separated list of detectors for which to run QC, can be a subset of `WORKFLOW_DETECTORS` (for standalone detectors QC) and `WORKFLOW_DETECTORS_MATCHING` (for matching/vertexing QC). If a detector (matching/vertexing step) is not listed in `WORKFLOW_DETECTORS` (`WORKFLOW_DETECTORS_MATCHING`), the QC is automatically disabled for that detector. Only active if the `WORKFLOW_PARAMETER=QC` is set. +- `WORKFLOW_DETECTORS_CALIB` (default `ALL`) : Comma-separated list of detectors for which to run calibration, can be a subset of `WORKFLOW_DETECTORS`. If a detector is not listed in `WORKFLOW_DETECTORS`, the calibration is automatically disabled for that detector. Only active if the `WORKFLOW_PARAMETER=CALIB` is set. +- `WORKFLOW_DETECTORS_FLP_PROCESSING` (default `TOF` for sync processing on EPN, `NONE` otherwise) : Signals that these detectors have processing on the FLP enabled. The corresponding steps are thus inactive in the EPN epl-workflow, and the raw-proxy is configured to receive the FLP-processed data instead of the raw data in that case. +- `WORKFLOW_DETECTORS_RECO` (default `ALL`) : Comma-separated list of detectors for which to run reconstruction. +- `WORKFLOW_DETECTORS_CTF` (default `ALL`) : Comma-separated list of detectors to include in CTF. +- `WORKFLOW_DETECTORS_MATCHING` (default selected corresponding to default workflow for sync or async mode respectively) : Comma-separated list of matching / vertexing algorithms to run. Use `ALL` to enable all of them. Currently supported options (see LIST_OF_GLORECO in common/setenv.h): `ITSTPC`, `TPCTRD`, `ITSTPCTRD`, `TPCTOF`, `ITSTPCTOF`, `MFTMCH`, `PRIMVTX`, `SECVTX`. +- `WORKFLOW_EXTRA_PROCESSING_STEPS` Enable additional processing steps not in the preset for the SYNC / ASYNC mode. Possible values are: `MID_RECO` `MCH_RECO` `MFT_RECO` `FDD_RECO` `FV0_RECO` `ZDC_RECO` `ENTROPY_ENCODER` `MATCH_ITSTPC` `MATCH_TPCTRD` `MATCH_ITSTPCTRD` `MATCH_TPCTOF` `MATCH_ITSTPCTOF` `MATCH_MFTMCH` `MATCH_MFTMCH` `MATCH_PRIMVTX` `MATCH_SECVTX`. (Here `_RECO` means full async reconstruction, and can be used to enable it also in sync mode.) +- `WORKFLOW_PARAMETERS` (default `NONE`) : Comma-separated list, enables additional features of the workflow. Currently the following features are available: + - `GPU` : Performs the TPC processing on the GPU, otherwise everything is processed on the CPU. + - `CTF` : Write the CTF to disk (CTF creation is always enabled, but if this parameter is missing, it is not stored). + - `EVENT_DISPLAY` : Enable JSON export for event display. + - `QC` : Enable QC. + - `CALIB` : Enable calibration (not yet working!) +- `RECO_NUM_NODES_OVERRIDE` (default `0`) : Overrides the number of EPN nodes used for the reconstruction (`0` or empty means default). +- `MULTIPLICITY_FACTOR_RAWDECODERS` (default `1`) : Scales the number of parallel processes used for raw decoding by this factor. +- `MULTIPLICITY_FACTOR_CTFENCODERS` (default `1`) : Scales the number of parallel processes used for CTF encoding by this factor. +- `MULTIPLICITY_FACTOR_REST` (default `1`) : Scales the number of other reconstruction processes by this factor. +- `QC_JSON_EXTRA` (default `NONE`) : extra QC jsons to add (if does not fit to those defined in WORKFLOW_DETECTORS_QC & (WORKFLOW_DETECTORS | WORKFLOW_DETECTORS_MATCHING) +Most of these settings are configurable in the AliECS GUI. But some of the uncommon settings (`WORKFLOW_DETECTORS_FLP_PROCESSING`, `WORKFLOW_DETECTORS_CTF`, `WORKFLOW_DETECTORS_RECO`, `WORKFLOW_DETECTORS_MATCHING`, `WORKFLOW_EXTRA_PROCESSING_STEPS`, advanced `MULTIPLICITY_FACTOR` settings) can only be set via the "Additional environment variables field" in the GUI using bash syntax, e.g. `WORKFLOW_DETECTORS_FLP_PROCESSING=TPC`. + +# Process multiplicity factors +- The production workflow has internally a default value how many instances of a process to run in parallel (which was tuned for Pb-Pb processing) +- Some critical processes for synchronous pp processing are automatically scaled by the inverse of the number of nodes, i.e. the multiplicity is increased by a factor of 2 if 125 instead of 250 nodes are used, to enable the processing using only a subset of the nodes. +- Factors can be provided externally to scale the multiplicity of processes further. All these factors are multiplied. + - One factor can be provided based on the type of the processes: raw decoder (`MULTIPLICITY_FACTOR_RAWDECODERS`), CTF encoder (`MULTIPLICITY_FACTOR_CTFENCODERS`), or other reconstruction process (`MULTIPLICITY_FACTOR_REST`) + - One factor can be provided per detector via `MULTIPLICITY_FACTOR_DETECTOR_[DET]` using the 3 character detector representation, or `MATCH` for the global matching and vertexing workflows. + - One factor can be provided per process via `MULTIPLICITY_FACTOR_PROCESS_[PROCESS_NAME]`. In the process name, dashes `-` must be replaced by underscores `_`. +- The multiplicity of an individual process can be overridden externally (this is an override, no scaling factor) by using `MULTIPLICITY_PROCESS_[PROCESS_NAME]`. In the process name, dashes `-` must be replaced by underscores `_`. +- For example, creating the workflow with `MULTIPLICITY_FACTOR_RAWDECODERS=2 MULTIPLICITY_FACTOR_DETECTOR_ITS=3 MULTIPLICITY_FACTOR_PROCESS_mft_stf_decoder=5` will scale the number of ITS raw decoders by 6, of other ITS processes by 3, of other raw decoders by 2, and will run exactly 5 `mft-stf-decoder` processes. + +# Additional custom control variables +For user modification of the workflow settings, the folloing *EXTRA* environment variables exist: +- `ARGS_ALL_EXTRA` : Extra command line options added to all workflows +- `ALL_EXTRA_CONFIG` : Extra config key values added to all workflows +- `GPU_EXTRA_CONFIG` : Extra options added to the configKeyValues of the GPU workflow +- `ARGS_EXTRA_PROCESS_[WORKFLOW_NAME]` : Extra command line arguments for the workflow binary `WORKFLOW_NAME`. Dashes `-` must be replaced by underscores `_` in the name! E.g. `ARGS_EXTRA_PROCESS_o2_tof_reco_workflow='--output-type clusters'` +- `CONFIG_EXTRA_PROCESS_[WORKFLOW_NAME]` : Extra `--configKeyValues` arguments for the workflow binary `WORKFLOW_NAME`. Dashes `-` must be replaced by underscores `_` in the name! E.g. `CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow='GPU_proc.debugLevel=1;GPU_proc.ompKernels=0;'` + +**IMPORTANT:** When providing additional environment variables please always use single quotes `'` instead of double quotes `"`, because otherwise there can be issues with whitespaces. E.g. `ARGS_EXTRA_PROCESS_o2_eve_display='--filter-time-min 0 --filter-time-max 120'` does work while `ARGS_EXTRA_PROCESS_o2_eve_display="--filter-time-min 0 --filter-time-max 120"` does not. + +In case the CTF dictionaries were created from the data drastically different from the one being compressed, the default memory allocation for the CTF buffer might be insufficient. One can apply scaling factor to the buffer size estimate (default=1.5) of particular detector by defining variable e.g. `TPC_ENC_MEMFACT=3.5` + +# File input for ctf-reader / raw-tf-reader +- The variable `$INPUT_FILE_LIST` can be a comma-seperated list of files, or a file with a file-list of CTFs/raw TFs. +- The variable `$INPUT_FILE_COPY_CMD` can provide a custom copy command (default is to fetch the files from EOS). + +# Remarks on QC +The JSON files for the individual detectors are merged into one JSON file, which is cached during the run on the shared EPN home folder. +The default JSON file per detector is defined in `qc-workflow.sh`. +JSONs per detector can be overridden by exporting `QC_JSON_[DETECTOR_NAME]`, e.g. `QC_JSON_TPC`, when creating the workflow. +The global section of the merged qc JSON config is taken from qc-sync/qc-global.json diff --git a/prodtests/full-system-test/documentation/env-variables.md b/prodtests/full-system-test/documentation/env-variables.md new file mode 100644 index 0000000000000..b93622c0a0f94 --- /dev/null +++ b/prodtests/full-system-test/documentation/env-variables.md @@ -0,0 +1,51 @@ +The `setenv-sh` script sets the following environment options +* `NTIMEFRAMES`: Number of time frames to process. +* `TFDELAY`: Delay in seconds between publishing time frames (1 / rate). +* `NGPUS`: Number of GPUs to use, data distributed round-robin. +* `GPUTYPE`: GPU Tracking backend to use, can be CPU / CUDA / HIP / OCL / OCL2. +* `SHMSIZE`: Size of the global shared memory segment. +* `DDSHMSIZE`: Size of shared memory unmanaged region for DataDistribution Input. +* `GPUMEMSIZE`: Size of allocated GPU memory (if GPUTYPE != CPU) +* `HOSTMEMSIZE`: Size of allocated host memory for GPU reconstruction (0 = default). + * For `GPUTYPE = CPU`: TPC Tracking scratch memory size. (Default 0 -> dynamic allocation.) + * Otherwise : Size of page-locked host memory for GPU processing. (Defauls 0 -> 1 GB.) +* `CREATECTFDICT`: Create CTF dictionary. +* `SAVECTF`: Save the CTF to a root file. + * 0: Read `ctf_dictionary.root` as input. + * 1: Create `ctf_dictionary.root`. Note that this was already done automatically if the raw data was simulated with `full_system_test.sh`. +* `SYNCMODE`: Run only reconstruction steps of the synchronous reconstruction. + * Note that there is no `ASYNCMODE` but instead the `CTFINPUT` option already enforces asynchronous processing. +* `NUMAGPUIDS`: NUMAID-aware GPU id selection. Needed for the full EPN configuration with 8 GPUs, 2 NUMA domains, 4 GPUs per domain. + In this configuration, 2 instances of `dpl-workflow.sh` must run in parallel. + To be used in combination with `NUMAID` to select the id per workflow. + `start_tmux.sh` will set up these variables automatically. +* `NUMAID`: SHM segment id to use for shipping data as well as set of GPUs to use (use `0` / `1` for 2 NUMA domains, 0 = GPUS `0` to `NGPUS - 1`, 1 = GPUS `NGPUS` to `2 * NGPUS - 1`) +* 0: Runs all reconstruction steps, of sync and of async reconstruction, using raw data input. +* 1: Runs only the steps of synchronous reconstruction, using raw data input. +* `EXTINPUT`: Receive input from raw FMQ channel instead of running o2-raw-file-reader. + * 0: `dpl-workflow.sh` can run as standalone benchmark, and will read the input itself. + * 1: To be used in combination with either `datadistribution.sh` or `raw-reader.sh` or with another DataDistribution instance. +* `CTFINPUT`: Read input from CTF ROOT file. This option is incompatible to EXTINPUT=1. The CTF ROOT file can be stored via SAVECTF=1. +* `NHBPERTF`: Time frame length (in HBF) +* `GLOBALDPLOPT`: Global DPL workflow options appended to o2-dpl-run. +* `EPNPIPELINES`: Set default EPN pipeline multiplicities. + Normally the workflow will start 1 dpl device per processor. + For some of the CPU parts, this is insufficient to keep step with the GPU processing rate, e.g. one ITS-TPC matcher on the CPU is slower than the TPC tracking on multiple GPUs. + This option adds some multiplicies for CPU processes using DPL's pipeline feature. + The settings were tuned for EPN processing with 4 GPUs (i.e. the default multiplicities are per NUMA domain). + The multiplicities are scaled with the `NGPUS` setting, i.e. with 1 GPU only 1/4th are applied. + You can pass an option different to 1, and than it will be applied as factor on top of the multiplicities. + It is auto-selected by `start-tmux.sh`. +* `SEVERITY`: Log verbosity (e.g. info or error, default: info) +* `INFOLOGGER_SEVERITY`: Min severity for messages sent to Infologger. (default: `$SEVERITY`) +* `SHMTHROW`: Throw exception when running out of SHM memory. + It is suggested to leave this enabled (default) on tests on the laptop to get an actual error when it runs out of memory. + This is disabled in `start_tmux.sh`, to avoid breaking the processing while there is a chance that another process might free memory and we can continue. +* `NORATELOG`: Disable FairMQ Rate Logging. +* `INRAWCHANNAME`: FairMQ channel name used by the raw proxy, must match the name used by DataDistribution. +* `WORKFLOWMODE`: run (run the workflow (default)), print (print the command to stdout), dds (create partial DDS topology) +* `FILEWORKDIR`: directory for all input / output files. E.g. grp / geometry / dictionaries etc. are read from here, and dictionaries / ctf / etc. are written to there. + Some files have more fine grained control via other environment variables (e.g. to store the CTF to somewhere else). Such variables are initialized to `$FILEWORKDIR` by default but can be overridden. +* `EPNSYNCMODE`: Specify that this is a workflow running on the EPN for synchronous processing, e.g. logging goes to InfoLogger, DPL metrics to to the AliECS monitoring, etc. +* `BEAMTYPE`: Beam type, must be PbPb, pp, pPb, cosmic, technical. +* `IS_SIMULATED_DATA` : 1 for MC data, 0 for RAW data. diff --git a/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md b/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md new file mode 100644 index 0000000000000..0c4637ece0920 --- /dev/null +++ b/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md @@ -0,0 +1,33 @@ +This is a quick summary how to run the full system test (FST) as stress test on the EPN. (For the full FST documentation, see https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md and https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test.md) + +# Preparing the data set +- I usually try to keep an up-to-date data set that can be used in `/home/drohr/alitest/tmp-fst*`. The folder with the highest number is the latest dataset. However, data formats are still evolving, and it requires rerunning the simulation regularly. I.e. please try my latest data set, if it doesn't work, please generate a new one as described below. +- Short overview how to generate a FST Pb-Pb 128 orbit data set: + - The O2 binaries installed on the EPN via RPMs use the `o2-dataflow` defaults and cannot run the simulation, and also they lack readout. Thus you need to build `O2PDPSuite` and `Readout` (the version matching the O2PDPSuite RPM you want to use for running the test) yourself with `alibuild` on an EPN: `aliBuild --defaults o2 build O2PDPSuite Readout --jobs 32 --debug`. The flag `--jobs` configures the number of parallel jobs and can be changed. + - Enter the O2PDPSuite environment either vie `alienv enter O2PDPSuite/latest Readout/latest`. + - Go to an empty directory. + - Run the FST simulation via: `NEvents=650 NEventsQED=10000 SHMSIZE=128000000000 TPCTRACKERSCRATCHMEMORY=40000000000 SPLITTRDDIGI=0 GENERATE_ITSMFT_DICTIONARIES=1 $O2_ROOT/prodtests/full_system_test.sh` + - Get a current matbud.root (e.g. from here https://alice.its.cern.ch/jira/browse/O2-2288) and place it in that folder. + - Create a timeframe file from the raw files: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`. + - Prepare the ramdisk folder: `mv raw/timeframe raw/timeframe-org; mkdir raw/timeframe-tmpfs; ln -s timeframe-tmpfs raw/timeframe` + +# Running the full system test +- Enter the environment! On an EPN do `module load O2PDPSuite` (this will load the latest O2 software installed on that EPN). +- Go into the folder with the data set (you might need to create one, see above). +- Prepare the ramdisk with the data: `sudo mount -t tmpfs tmpfs raw/timeframe-tmpfs; sudo cp raw/timeframe-org/* raw/timeframe` + - (NOTE that the ramdisk might already be present from previous tests, or in a different folder. Check the mounted tmpfs filesystems (`mount | grep tmpfs`), and don't mount multiple of them since memory is critical!) + - If you do not have root permissions and cannot create a ramdisk, the test will also work without. In that case you should decrease the publishing rate below to `TFDELAY=5`. +- Make sure disk caches are cleared: as ROOT do: `echo 1 > /proc/sys/vm/drop_caches` +- In order to run the Full System Test, the workflow must be able to access the CCDB. Normally, if you run as user, you must make sure to have an alien token present. On the EPN, one can use the EPN-internal CCDB server instead, which does not require alien access. If you use the `start-tmux.sh`, the env variables are set automatically to access the EPN-internal CCDB server. +- Start the FST with 2 NUMA domains: `TFDELAY=2.5 NTIMEFRAMES=1000000 $O2_ROOT/prodtests/full-system-test/start_tmux.sh dd` + +This will start a tmux session with 3 shells, the upper 2 shells are the 2 DPL workflows, one per NUMA domain, for the processing. The lower shell is the input with DataDistribution's StfBuilder. Leave it running and check that the StfBuilder doesn't complain that its buffer is full. Then the EPN can sustain the rate. + +# **NOTE** +- Attached to this ticket is a screenshot of how the console should look like: + - The DD console (on the bottom) should not show warnings about full buffers. + - The other 2 consoles (1 per NUMA domain) should show the processing times per TF for the GPU reconstruction: + ``` + [2974450:gpu-reconstruction_t3]: [10:50:38][INFO] GPU Reoncstruction time for this TF 26.77 s (cpu), 17.8823 s (wall) + ``` + This should be 17 to 18 seconds, and you should see it for all 4 GPUs on both NUMA domains (`reconstruction_t0` to `reconstruction_t3`) diff --git a/prodtests/full-system-test/documentation/full-system-test-setup.md b/prodtests/full-system-test/documentation/full-system-test-setup.md new file mode 100644 index 0000000000000..82ef9b7d0c74f --- /dev/null +++ b/prodtests/full-system-test/documentation/full-system-test-setup.md @@ -0,0 +1,124 @@ +This is some documentation for the full system test setup. + +If you just want to test a small dataset, you can skip the following steps, and jusddt skip to the end, where you will find a download with a prepared data set! + +# Requirements: +- The FST needs a lot of memory. Please check the comments below, make sure your system has enough memory, and change the memory sizes in the command lines accordingly. +- ulimits: The FST needs large ulimits for memory and virtual memory (`ulimit -m` / `ulimit -v`). This is usually no problem since they are usually unlimited. If GPUs are used, the FST also needs `ulimit -l` (for locked memory) unlimited, which is usualy not the system default. Finally, if data is replayed from raw files (not with DataDistribution), the FST will open many files, and `ulimit -n` should be at least 4096. Note that in most distributions the hard ulimits are configured in `/etc/security/limits.conf`. +- The FST needs to access the CCDB. For this, you should run the FST with an alien token. Alternatively, if you are on the EPN you can use the EPN-internal CCDB server by exporting `ALL_EXTRA_CONFIG="NameConf.mCCDBServer=http://o2-ccdb.internal;"` and by setting the DPL CCDB backend on the command line. If you are using `start-tmux.sh` for the 8 GPU FST, the CCDB backends are set automatically. + +# Creating the raw data and run the FST: +1. First some remarks on the number of events and the memory size: + - Generation (simulation) of the full time frame with ~550 collisions will need ~256 GB, processing will take less. + - Due to the sampling of the bunch crossings, the exact number of collissions that will be in the TF is not clear, thus one should simulate 600 collisions to generate a full 128 orbit TF. + - The default shared memory size is 2 GB, and must be increased significantly for large time frames, 128 GB is sufficient for 128 orbit TF, 160 GB is needed if MC labels are present in addition. + - The GPU memory allocation should be set to ~13 GB for 70 orbits and 21 GB for 128 orbits. + - I'd suggest to do a first small test with 1-5 events to check the machinery, 100 events is already a good size which should not exhaust the memory, I'd go to 600 only after 100 works. +1. Compile O2 with GPU support, in addition you need O2sim, DataDistribution, and Readout (latest versions from alidist will do). + GPUs for O2 should be auto-detected, but you can set the environment variables ALIBUILD_ENABLE_CUDA / ALIBUILD_ENABLE_HIP to enforce it (and get a failure when detection fails). Look for CMake log messages "Building GPUTracking with CUDA support" (etc) to verify. + For more information, see https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build.md +1. Optionally place some binary configuration files in the simulation folder. Default objects will be used if no such files are placed. There are instructions at the end of this post how to generate these files. (Currently, these files are: matbud.root, ITSdictionary.bin, ctf_dictionary.root, tpctransform.root, dedxsplines.root, and tpcpadgaincalib.root) +1. Load the O2sim environment (`alienv enter O2sim/latest`) and run the following full system test script for a full simulation and digits to raw conversion (this will already include 1 CPU reconstruction run): + ``` + NEvents=600 NEventsQED=35000 SHMSIZE=128000000000 TPCTRACKERSCRATCHMEMORY=30000000000 $O2_ROOT/prodtests/full_system_test.sh + ``` + - This create a full 128 orbit TF with 550 collisions and uses 35000 interactions for the QED background + - It uses 128 GB of shared memory + - The scratch memory size for the TPC reconstruction is set to 24 GB (Note, this is the CPU-equivalent of the GPU memory size, since this phase will only run on the CPU). +1. Test of the workflow using the raw-file-reader: Run the so far largest workflow, The GPU and SHM memory sizes must be reasonably large (see above). + ``` + SHMSIZE=128000000000 NTIMEFRAMES=10 TFDELAY=100 GPUTYPE=CPU $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh + ``` + Note that This uses 128 GB of SHM, runs only on the CPU, and processes the time frame 10 times in a loop with 100 s delay between the publiushing. + - For a documentation of the options, see https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test.md + - For running on the GPU (4 GPUs with the HIP backend), please do + ``` + SHMSIZE=128000000000 NTIMEFRAMES=10 TFDELAY=10 GPUTYPE=HIP NGPUS=4 GPUMEMSIZE=22000000000 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh + ``` +This will use 4 GPU with the HIP backend and allocate 22 GB of scratch memory on the GPU (should be sufficient for 128 orbit TF). You can change the GPU type as indicated in the linked README.md above, e.g. `GPUTYPE=CUDA NGPUS=1` for 1 CUDA GPU. +1. With this, the full chain is running inside O2 DPL. Next we are adding DataDistribution. + 1. Ceate the TF files as explained in the subtask (https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/raw-data-simulation.md). For convenience, there is a script that should do it automatically, from a shell that has loaded both DataDistribution and Readout: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`. + 1. Enter the O2 environment, and run the following script (please adjust the variables as in the test before). + ``` + EXTINPUT=1 SHMSIZE=128000000000 GPUTYPE=CPU $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh + ``` + - As a first optional test without DataDistribution, we can take the RawReader to feed the data in the way DataDistribution does. Run the following script in a second shell within the O2 environment. (Please adjust the variables as noted above) + ``` + SHMSIZE=128000000000 NTIMEFRAMES=10 TFDELAY=100 $O2_ROOT/prodtests/full-system-test/raw-reader.sh + ``` + 1. In a second shell with DataDistribution, run the following script (adjust the 2 variables for memory size as needed for your data, and set the TF_DIR variable to the folder where you recorded the time frame). Make sure you start this script ONLY AFTER the DPL workflow has fully started! There is no number of timeframes, it will run in an endless loop + ``` + SHMSIZE=128000000000 DDSHMSIZE=32000 TFDELAY=100 $O2_ROOT/prodtests/full-system-test/datadistribution.sh + ``` +1. The full chain that will be running on the EPN farm is a bit more complicated. It consists of: + - 2 instances of the dpl-workflow driving 4 GPUs each, one per NUMA domain. + - 1 instance of data distribution feeding a shared input buffer. + The following script runs the full system test in the 8 GPU EPN configuration using tmux with 3 sessions:{code}TFDELAY=2.8457 NTIMEFRAMES=128 $O2_ROOT/prodtests/full-system-test/start-tmux.sh dd{code} + - Note that number of GPUs / memory sizes are automatically set by start-tmux.sh. + - This TFDELAY is the rate for processing 1/250th of 50 kHz Pb-Pb with average time frames. Since the occupancy of your simulated timeframe will fluctuate, it is suggested to scale the TFDELAY linearly with the number of tpc clusters (shown in the console output of the dpl-workflow), with the average corresponding to 2.8457 s being 313028012 clusters. + - You can for testing alternatively use the rawreader instead of datadistribution as input in the start_tmux.sh script by passing rr instead of dd. +1. On the EPN, an SHM management tool owns the memory in the background and keeps it locked. This is done in order to speed up the startup. This behavior can be reproduced in the full system test, by setting the env variable `SHM_MANAGER_SHMID` to the shm id to be used (must be set for both `start_tmux.sh` and `shm-tool.sh`) you can juse use `SHM_MANAGER_SHMID=1` for a test) and running in a separate shell before starting `start_tmux.sh` + ``` + SHM_MANAGER_SHMID=1 SHMSIZE=$((128<<30)) DDSHMSIZE=$((128<<10)) $O2_ROOT/prodtests/full-system-test/shm-tool.sh + SHM_MANAGER_SHMID=1 TFDELAY=2.8457 NTIMEFRAMES=8 $O2_ROOT/prodtests/full-system-test/start-tmux.sh dd + ``` + +--- + +# Remarks for running with distortions: +1. To run the digitization with distortions, add the following to the digitizer command (using map inputSCDensity3D_8000_0 from file../InputSCDensityHistograms_8000events.root): + ``` + --distortionType 2 --initialSpaceChargeDensity=../InputSCDensityHistograms_8000events.root,inputSCDensity3D_8000_0 + ``` +1. To rerun the digitization with the same BC sampling for the collisions add + ``` + --incontext collisioncontext.root + ``` +1. To create the tpc fast transform map from the SCD object run: + ``` + root -l -q -b ~/alice/O2/Detectors/TPC/reconstruction/macro/createTPCSpaceChargeCorrection.C++'("../InputSCDensityHistograms_8000events.root", "inputSCDensity3D_8000_0")' + ``` +1. In order to use the fast transform map for TPC tracking, add to the tpc-recop-workflow: + ``` + --configKeyValues "GPU_global.transformationFile=tpctransform.root" + ``` + +--- + +# Remarks for creating other prerequisite binary files: +1. To create the CTF dictionary: Run the full system test workflow once setting the env variable CREATECTFDICT=1: + ``` + CREATECTFDICT=1 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh + ``` +1. Create the ITS pattern dictionary + ``` + o2-its-reco-workflow --trackerCA --disable-mc --configKeyValues "fastMultConfig.cutMultClusLow=30000;fastMultConfig.cutMultClusHigh=2000000;fastMultConfig.cutMultVtxHigh=500" + root -b -q ~/alice/O2/Detectors/ITSMFT/ITS/macros/test/CheckTopologies.C++ + ``` + - Note that the ITS dictionary used for raw generation and for reconstruction must be the same. I.e., if you change this, you have to either restart from scratch with the new dictionary file or rerun the ITS raw generation part of `$O2_ROOT/prodtests/full_system_test.sh`. +1. To create the material lookup table + ``` + root -l -q -b $O2_ROOT/Detectors/Base/test/buildMatBudLUT.C + ``` +1. missing here: dedxsplines.root, tpcpadgaincalib.root + +--- + +# Measuring startup time: +- In order to measure the time for each individual GPU memory registration step, please add `CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.benchmarkMemoryRegistration=1;"`. This should show you 2 times ~2 seconds per GPU process for the 2 large segments (DD and the global segment, could also report some additional smaller segments, only 1 in case you don't use the readout proxy). +- In order to measure the total startup time, you can use the `start_tmux.sh` script with the option `FST_BENCHMARK_STARTUP=1`. It will print for both DPL chains 2 times at the beginning: The first is when it starts the workflow JSON generation, the second is after the JSON generation when the actual workflow is started. For the process startup time, you have to take the difference from that time until the time when the last process has reched the READY state. (Note that this should be done with the `$O2_ROOT/prodtests/full-system-test/shm-tool.sh` as instructed above.) + ``` + Fri Jan 28 11:25:48 CET 2022 + Fri Jan 28 11:25:56 CET 2022 + [...] + [1456583:gpu-reconstruction_t0]: [11:26:18][INFO] fair::mq::Device running... + ``` + - This corresponds to a JSON creation time of 8 seconds (will usually not cound for the startup since it is cached, and a process startup time of 22 seconds. +--- + +# Other remarks:# Other remarks: +1. To run with low b-field, add to o2-sim: + ``` + --field -2 + ``` +1. To create a sample of multiple TF files for StfBuilder, use the script `$O2_ROOT/prodtests/full-system-test/generate_timeframe_files.sh`. diff --git a/prodtests/full-system-test/README.md b/prodtests/full-system-test/documentation/full-system-test.md similarity index 95% rename from prodtests/full-system-test/README.md rename to prodtests/full-system-test/documentation/full-system-test.md index a52dfbc5d1203..80cc08baa2255 100644 --- a/prodtests/full-system-test/README.md +++ b/prodtests/full-system-test/documentation/full-system-test.md @@ -10,7 +10,7 @@ The full system test consists of 2 parts (detailed below): The relevant scripts are `/prodtests/full_system_test.sh` and all scripts in `/prodtests/full-system-test`. Note that by default the `full_system_test.sh` script will do both, run the generation and then the sysc and the async workflow. -This is only a quickstart guide, for more information see https://alice.its.cern.ch/jira/browse/O2-1492. +This is only a quickstart guide, for more information see https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md. In order to run the full system test, you need to run in the O2sim environment (`alienv enter O2sim/latest`): ``` @@ -50,7 +50,7 @@ The generation part (in `prodtests/full_system_test.sh` runs the following steps The `prodtests/full_system_test.sh` uses `Utilities/Tools/jobutils.sh` for running the jobs, which creates a log file for each step, and which will automatically skip steps that have already succeeded if the test is rerun in the current folder. I.e. if you break the FST or it failed at some point, you can rerun the same command line and it will continue after the last successful step. See `Utilities/Tools/jobutils.sh` for details. Note that by default, the generation produces raw files, which can be consumed by the `raw-file-reader-workflow` and by `o2-readout-exe`. -The files can be converted into timeframes files readable by the StfBuilder as described in https://alice.its.cern.ch/jira/browse/O2-1492. +The files can be converted into timeframes files readable by the StfBuilder as described in https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md. ## Full system test DPL-workflow configuration and scripts @@ -80,7 +80,7 @@ The `dpl-workflow.sh` can run both the synchronous and the asynchronous workflow All settings are configured via environment variables. The default settings (if no env variable is exported) are defined in `setenv.sh` which is sourced by all other scripts. (Please note that `start_tmux.sh` overrides a couple of options with EPN defaults). -The environment variables are documented here: https://github.com/AliceO2Group/O2DPG/blob/master/DATA/common/README.md +The environment variables are documented here: https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-env-variables.md ## Files produced / required by the full system test diff --git a/prodtests/full-system-test/documentation/raw-data-simulation.md b/prodtests/full-system-test/documentation/raw-data-simulation.md new file mode 100644 index 0000000000000..fbf6ace7d6934 --- /dev/null +++ b/prodtests/full-system-test/documentation/raw-data-simulation.md @@ -0,0 +1,43 @@ +This procedure will create (S)TF files from raw data prepared as described in the main ticket. The data must be using RDHv6. +Create configuration for the readout.exe with all input files we want in the TF. This will create rdo_TF.cfg file. + +  +``` +ulimit -n 4096 # Make sure we can open sufficiently many files cd raw# ls raw: ITS TPC TOF ... + +# copy gen_rdo_cfg.sh script attached here to the raw directory +# Run the script with number of HBF/TF and list directories you want to include in the TF + +~raw> ./gen_rdo_cfg.sh 128 TPC ITS TOF # ... others{code} +```  + +In a separate shell load a recent DataDistribution module and start StfBuilder to record the TF: +``` +export TF_PATH=$(pwd) +StfBuilder --id=stfb --detector-rdh=6 --detector-subspec=feeid --stand-alone --channel-config "name=readout,type=pull,method=connect,address=ipc:///tmp/readout-to-datadist-0,transport=shmem,rateLogging=1" --data-sink-dir=${TF_PATH} --data-sink-sidecar --data-sink-enable +``` + +Start the readout.exe (at least v1.4.3) using the generated config file. The dataflow will have a 10-20 seconds of delay, in order to have all input files loaded. +``` +ulimit -n 4096 # Make sure we can open sufficiently many files +~raw> readout.exe file:rdo_TF.cfg{code} +``` +  +Upon data transfer to StfBuilder, readout will print the stats, like: +``` +2020-06-23 18:07:59.003364 Last interval (1.00s): blocksRx=0, block rate=0.00, bytesRx=0, rate=0.000 b/s +2020-06-23 18:08:00.003382 Last interval (1.00s): blocksRx=2930, block rate=2930.00, bytesRx=1156508880, rate=9.252 Gb/s +2020-06-23 18:08:01.003384 Last interval (1.00s): blocksRx=0, block rate=0.00, bytesRx=0, rate=0.000 b/s{noformat} +``` + +StfBuilder will print one warning regarding the timeout on the last received TF. This can be ignored in this case. The log should look like : + +```  +{noformat}[2020-06-23 18:07:59.928][I] readout[0]: in: 1224 (1156.52 MB) out: 0 (0 MB) +[2020-06-23 18:08:01.733][W] READOUT INTERFACE: finishing STF on a timeout. stf_id=1 size=1156508880 +[2020-06-23 18:08:02.607][I] Sending STF out. stf_id=1 channel=standalone-chan[0] stf_size=1156508880 unique_equipments=1224{noformat} +``` + +After this, both processes can be closed with Ctrl-C. The resulting TFs are stored in a new directory under TF_PATH (the name of the dir is the time of running) + +  From 20090c107b73afe35733e865c3251214c33ba0f7 Mon Sep 17 00:00:00 2001 From: David Rohr Date: Thu, 24 Apr 2025 22:55:03 +0200 Subject: [PATCH 2/2] GPU: Add documentation --- GPU/documentation/README.md | 0 GPU/documentation/build-O2.md | 62 +++++++++++++++++++ GPU/documentation/build-standalone.md | 86 +++++++++++++++++++++++++++ 3 files changed, 148 insertions(+) create mode 100644 GPU/documentation/README.md create mode 100644 GPU/documentation/build-O2.md create mode 100644 GPU/documentation/build-standalone.md diff --git a/GPU/documentation/README.md b/GPU/documentation/README.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/GPU/documentation/build-O2.md b/GPU/documentation/build-O2.md new file mode 100644 index 0000000000000..809d1fe0d5439 --- /dev/null +++ b/GPU/documentation/build-O2.md @@ -0,0 +1,62 @@ +This ticket will serve as documentation how to enable which GPU features and collect related issues. + +So far, the following features exist: + * GPU Tracking with CUDA + * GPU Tracking with HIP + * GPU Tracking with OpenCL (>= 2.1) + * OpenGL visualization of the tracking + * ITS GPU tracking + +GPU support should be detected and enabled automatically. +If you just want to reproduce the GPU build locally without running it, it might be easiest to use the GPU CI container (see below). +The provisioning script of the container also demonstrates which patches need to be applied such that everything works correctly. + +*GPU Tracking with CUDA* + * The CMake option -DENABLE_CUDA=ON/OFF/AUTO steers whether CUDA is forced enabled / unconditionally disabled / auto-detected. + * The CMake option -DCUDA_COMPUTETARGET= fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture) + * CUDA is detected via the CMake language feature, so essentially nvcc must be in the Path. + * We require CUDA version >= 11.2 + * CMake will report "Building GPUTracking with CUDA support" when enabled. + +*GPU Tracking with HIP* + * HIP and HCC must be installed, and CMake must be able to detect HIP via find_package(hip). + * If HIP and HCC are not installed to /opt/rocm, the environment variables $HIP_PATH and $HCC_HOME must point to the installation directories. + * HIP from ROCm >= 4.0 is required. + * The CMake option -DHIP_AMDGPUTARGET= forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU). + * CMake will report "Building GPUTracking with HIP support" when enabled. + * It may be that some patches must be applied to ROCm after the installation. You find the details in the provisioning script of the GPU CI container below. + +*GPU Tracking with OpenCL (Needs Clang >= 18 for compilation)* + * Needs OpenCL library with version >= 2.1, detectable via CMake find_package(OpenCL). + * Needs the SPIR-V LLVM translator together with LLVM to create the SPIR-V binaries, also detectable via CMake. + +*OpenGL visualization of TPC tracking* + * Needs the following libraries (all detectable via CMake find_package): libOpenGL, libGLEW, libGLFW, libGLU. + * OpenGL must be at least version 4.5, but this is not detectable at CMake time. If the supported OpenGL version is below, the display is not/partially built, and not available at runtime. (Whether it is not or partially built depends on whether the maximum OpenGL version supported by GLEW or that of the system runtime in insufficient.) + * Note: If ROOT does not detect the system GLEW library, ROOT will install its own very outdated GLEW library, which will be insufficient for the display. Since the ROOT include path will come first in the order, this will prevent the display from being built. + * CMake will report "Building GPU Event Display" when enabled. + +*Vulkan visualization* + * similar to OpenCL visualization, but with Vulkan. + +*ITS GPU Tracking* + * So far supports only CUDA and HIP, support for OpenCL might come. + * The build is enabled when the "GPU Tracking with CUDA" (as explained above) detects CUDA, same for HIP. + * CMake will report "Building ITS CUDA tracker" when enabled, same for HIP. + +*Using the GPU CI container* + * Setting up everything locally might be somewhat time-consuming, instead you can use the GPU CI cdocker container. + * The docker images is `alisw/slc8-gpu-builder`. + * The container exports the `ALIBUILD_O2_FORCE_GPU` env variable, which force-enables all GPU builds. + * Note that it might not be possible out-of-the-box to run the GPU version from within the container. In case of HIP it should work when you forwards the necessary GPU devices in the container. For CUDA however, you would either need to (in addition to device forwarding) match the system CUDA driver and toolkit installation to the files present in the container, or you need to use the CUDA docker runtime, which is currently not installed in the container. + * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh. If you want to reproduce the installation locally, it is recommended to follow the steps from the script. + +*Summary* + +If you want to enforce the GPU builds on a system without GPU, please set the following CMake settings: + * ENABLE_CUDA=ON + * ENABLE_HIP=ON + * ENABLE_OPENCL=ON + * HIP_AMDGPUTARGET=gfx906;gfx908 + * CUDA_COMPUTETARGET=86 89 +Alternatively you can set the environment variables ALIBUILD_ENABLE_CUDA and ALIBUILD_ENABLE_HIP to enforce building CUDA or HIP without modifying the alidist scripts. diff --git a/GPU/documentation/build-standalone.md b/GPU/documentation/build-standalone.md new file mode 100644 index 0000000000000..d4e9da5cd5bf3 --- /dev/null +++ b/GPU/documentation/build-standalone.md @@ -0,0 +1,86 @@ +This ticket describes how to build the O2 GPU TPC Standalone benchmark (in its 2 build types), and how to run it. + +The purpose of the standalone benchmark is to make the O2 GPU TPC reconstruction code available standalone. It provides +- external tests when people do not have / want to build O2, have no access to alien for CCDB, etc. +- fast standalone tests without running O2 workflows and overhead from CCTD. +- faster build times than rebuilding O2 for development. + +# Compiling + +The standalone benchmark is build as part of O2, and it can be built standalone. + +As part of O2, it is available from the normal O2 build as the executable `o2-gpu-standalone-benchmark`, GPU support is available for all GPU types supported by the O2 build. + +Building it as standalone benchmark requires several dependencies, and provides more control which features to enable / disable. +The dependencies can be taken from the system, or we can use alidist to build O2 and take the dependencies from there. + +In order to do the latter, please execute: +``` +cd ~/alice # or your alice folder +aliBuild build --defaults o2 O2 +source O2/GPU/GPUTracking/Standalone/cmake/prepare.sh +``` + +Then, in order to compile the standalone tool, assuming to have it in ~/standalone and build in ~/standalone/build, please run: +``` +mkdir -p ~/standalone/build +cd ~/standalone/build +cmake -DCMAKE_INSTALL_PREFIX=../ ~/alice/O2/GPU/GPUTracking/Standalone/ +nano config.cmake # edit config file to enable / disable dependencies as needed. In case cmake failed, and you disabled the dependency, just rerun the above command. +make install -j32 +``` + +You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in O2-786. +And there are plenty of additional settings to enable/disable event display, qa, usage of ROOT, FMT, etc. libraries. + +This will create the `ca` binary in `~/standalone`, which is basically the same as the `o2-gpu-standalone-benchmark`, but built outside of O2. + +# Running + +The following command lines will use `./ca`, in case you use the executable from the O2 build, please replace by `o2-gpu-standalone-benchmark`. + +You can get a list of command line options by `./ca --help` and `./ca --helpall`. + +In order to run, you need a dataset. See the next section for how to create a dataset. Datasets are stored in `~/standalone/events`, and are identified by their folder names. The following commands assume a testdataset of name `o2-pbpb-100`. + +To run on that data, the simpled command is `./ca -e o2-pbpb-100`. This will automatically use a GPU if available, trying all backends, otherwise fall back to CPU. +You can force using GPU or CPU with `-g` and `-c`. +You can select the backend via `--gpuType CUDA|HIP|OCL|OCL2`, and inside the backend you can select the device number, if multiple devices exist, via `--gpuDevice i`. + +The flag `--debug` (-2 to 6) enables increasingly extensive debug output, and `--debug 6` stores full data dumpts of all intermediate steps to files. +>= `--debug 1` has a performance impact since it adds serialization points for debugging. For timing individual kernels, `--debug 1` prints timing information for all kernels. +An example line would .e.g. be +``` +./ca -e o2-pbpb-100 -g --gpuType CUDA --gpuDevice 0 --debug 1 +``` + +Some other noteworthy options are `--display` to run the GPU event display, `--qa` to run a QA task on MC data, `--runs` and `--runs2` to run multiple iterations of the benchmark, `--printSettings` to print all the settings that were used, `--memoryStat` to print memory statistics, `--sync` to run with settings for online reco, `--syncAsync` to run online reco first, and then offline reco on the produced TPC CTF data, `--setO2Settings` to use some defaults as they are in O2 not in the standalone version, `--PROCdoublePipeline` to enable the double-threaded pipeline for best performance (works only with multiple iterations, and not in async mode), and `--RTCenable` to enable the run time compilation improvements (check also `--RTCcacheOutput`). +An example for a benchmark in online mode would be: +``` +./ca -e o2-pbpb-100 -g --sync --setO2Settings --PROCdoublePipeline --RTCenable --runs 10 +``` + +# Generating a dataset + +The standalone benchmark supports running on Run2 data exported from AliRoot, or to run on Run3 data from O2. This document covers only the O2 case. +In o2, `o2-tpc-reco-workflow` and the `o2-gpu-reco-workflow` can dump event data with the `configKeyValue` `GPU_global.dump=1;`. +This will dump the event data to the local folder, all dumped files have a `.dump` file extension. If there are multiple TFs/events processed, there will be multiple `event.i.dump` files. In order to create a standalone dataset out of these, just copy all the `.dump` files to a subfolder in `~/standalone/events/[FOLDERNAME]`. + +Data can be dumped from raw data, or from MC data, e.g. generated by the Full System Test. In case of MC data, also MC labels are dumped, such that they are used in the `./ca --qa` mode. + +To get a dump from simulated data, please run e.g. the FST simulation as described in O2-2633. +A simple run as +``` +DISABLE_PROCESSING=1 NEvents=5 NEventsQED=100 SHMSIZE=16000000000 $O2_ROOT/prodtests/full_system_test.sh +``` +should be enough. + +Afterwards run the following command to dump the data: +``` +SYNCMODE=1 CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.dump=1;" WORKFLOW_DETECTORS=TPC SHMSIZE=16000000000 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh +``` + +To dump standalone data from CTF raw data in `myctf.root`, you can use the same script, e.g.: +``` +CTFINPUT=1 INPUT_FILE_LIST=myctf.root CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.dump=1;" WORKFLOW_DETECTORS=TPC SHMSIZE=16000000000 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh +```