TVMArrayCopyToBytes function is very slow on Jetson NX

I am trying to do YOLO object detection inference and for copying outputs from GPU to CPU takes 99% of the whole inference process. Any idea to resolve this bottleneck ?

Thanks

This is common, because most of the GPU operations are asynchronous, the Copy operation includes time for both executing the GPU kernels and the copy itself

1 Like

Does that mean autotuning might accelerate this copying operation ?

That means we should use autotuning or other approaches to accelerate the compute. If you place a gpu(0).sync() before the copy, you can then find most of the cost goes to the synchronization.

1 Like