TVMArrayCopyToBytes function is very slow on Jetson NX

HilmiK · August 7, 2020, 7:26am

I am trying to do YOLO object detection inference and for copying outputs from GPU to CPU takes 99% of the whole inference process. Any idea to resolve this bottleneck ?

Thanks

tqchen · August 9, 2020, 1:39am

This is common, because most of the GPU operations are asynchronous, the Copy operation includes time for both executing the GPU kernels and the copy itself

HilmiK · August 9, 2020, 8:45am

Does that mean autotuning might accelerate this copying operation ?

tqchen · August 9, 2020, 4:43pm

That means we should use autotuning or other approaches to accelerate the compute. If you place a gpu(0).sync() before the copy, you can then find most of the cost goes to the synchronization.