I am trying to do YOLO object detection inference and for copying outputs from GPU to CPU takes 99% of the whole inference process. Any idea to resolve this bottleneck ?
Thanks
I am trying to do YOLO object detection inference and for copying outputs from GPU to CPU takes 99% of the whole inference process. Any idea to resolve this bottleneck ?
Thanks
This is common, because most of the GPU operations are asynchronous, the Copy operation includes time for both executing the GPU kernels and the copy itself
Does that mean autotuning might accelerate this copying operation ?
That means we should use autotuning or other approaches to accelerate the compute. If you place a gpu(0).sync()
before the copy, you can then find most of the cost goes to the synchronization.