[QST] Could the cudaStreamSynchronize method be a bottleneck?

My question is: Is it possible reduce the CPU time required by cudaStreamSynchronize ?
Context:
My goal is to use the nvComp library, in order to occupy (for compression operations), as little cpu time as possible. 
For that reason, I detected the cpu time with std:chrono, immediately before the data transfer operations from host to device, 
and immediately after retrieving the compressed data from host. To retrieve the compressed data from Host, in order to 
perform cpu-side operations (save compressed data on disk, send them over network, ...), I have to run the cudaStreamSynchronize(_cudaStream) command, where _cudaStream is the stream used by all cudaMemcpyAsync operations and by the nvcompBatchedZstdCompressAsync (or LZ4, Deflate and so on) compression method. All operations performed, i.e., host device transfer, compression, and host device transfer, are done using asynchronous methods on _cudaStream. The problem is that the CPU time I detect after invoking the cudaStreamSynchronize command is a time that seems high to me, and for my purposes it seems to be the bottleneck. 

Some details:  
OS: Windows 11, 32 GB RAM
CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
GPU: NVIDIA GeForce GTX 1650
![immagine](https://github.com/NVIDIA/nvcomp/assets/16136616/2f37ce59-7437-4bea-b3c9-d3f9bd09650b)

The table above contains the results of performing the compression of data blocks (consisting of number of gradually increasing images: 1 frame block, 10 frames block, 20 frames block and 90 frames block) and the respective CPU times. The size of each single image is 3.9 MB (all images have the same dimension).
 
Looking at the "CPU Comp ElapsedTime (ms)" column, you can see that for 3.9 MB image size, compression requires from 20 to 30
ms (apart the first case which has a very high time, but I couldn't find an explanation). This time grows linearly as the file size increases, but the average cpu time per image (column "AvgCPU Compression Time per Image (ms)" obtained by CPU Time / Number of frames) varies between 12 and 17 ms. 

Is it possible to reduce the CPU time required by cudaStreamSynchronize? I.e. is it possible get the average cpu time per image 
under 5 ms?

I tried creating k std::thread async to compress k images using a different stream for each thread. 
Most likely I'm doing something wrong as the total time for compressing the k images turns out to be high (500 ms). 
Could multithreaded approach reduce cpu time caused by cudaStreamSynchronize?

Thank you in advance
Andrea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Could the cudaStreamSynchronize method be a bottleneck? #105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Could the cudaStreamSynchronize method be a bottleneck? #105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions