-
Notifications
You must be signed in to change notification settings - Fork 91
Description
My question is: Is it possible reduce the CPU time required by cudaStreamSynchronize ?
Context:
My goal is to use the nvComp library, in order to occupy (for compression operations), as little cpu time as possible.
For that reason, I detected the cpu time with std:chrono, immediately before the data transfer operations from host to device,
and immediately after retrieving the compressed data from host. To retrieve the compressed data from Host, in order to
perform cpu-side operations (save compressed data on disk, send them over network, ...), I have to run the cudaStreamSynchronize(_cudaStream) command, where _cudaStream is the stream used by all cudaMemcpyAsync operations and by the nvcompBatchedZstdCompressAsync (or LZ4, Deflate and so on) compression method. All operations performed, i.e., host device transfer, compression, and host device transfer, are done using asynchronous methods on _cudaStream. The problem is that the CPU time I detect after invoking the cudaStreamSynchronize command is a time that seems high to me, and for my purposes it seems to be the bottleneck.
Some details:
OS: Windows 11, 32 GB RAM
CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
GPU: NVIDIA GeForce GTX 1650

The table above contains the results of performing the compression of data blocks (consisting of number of gradually increasing images: 1 frame block, 10 frames block, 20 frames block and 90 frames block) and the respective CPU times. The size of each single image is 3.9 MB (all images have the same dimension).
Looking at the "CPU Comp ElapsedTime (ms)" column, you can see that for 3.9 MB image size, compression requires from 20 to 30
ms (apart the first case which has a very high time, but I couldn't find an explanation). This time grows linearly as the file size increases, but the average cpu time per image (column "AvgCPU Compression Time per Image (ms)" obtained by CPU Time / Number of frames) varies between 12 and 17 ms.
Is it possible to reduce the CPU time required by cudaStreamSynchronize? I.e. is it possible get the average cpu time per image
under 5 ms?
I tried creating k std::thread async to compress k images using a different stream for each thread.
Most likely I'm doing something wrong as the total time for compressing the k images turns out to be high (500 ms).
Could multithreaded approach reduce cpu time caused by cudaStreamSynchronize?
Thank you in advance
Andrea