Any tips on copying data from cpu to gpu, bottleneck?


Helo everyone,

As I’m working on yolo, I developed the opencl based tvm model. I’m able to reduce the inference time of such big model to 0.01s/image, but gpu to cpu copying speed made it worthless as it is taking ~4s . I tried tvm_opencl_sync() but helpless. I go through the opencl_api.c and found it was already at the best possible method to copy data (using clenqueue and clbuffer methods, support for opencl >1.2). I read that opencl_sync() makes a cpu core to synchronize with gpu as they are asynchronous processes, can we ‘globally lock’ the opencl_sync() to a seperate cpu_core using multiprocessing? would that work and a better idea? Or should I manually manage using multiprocessing to set and get inputs and not making cpu idle?
Any tips/suggestions ?



0.01s/image -> how did you measure this? The copy will wait till all the execution to get finished. cpu-gpu copy time cannot be so high. I think the time is taken for running the model.


Timing OpenCL functions, especially ones that need to compile for a GPU a target, is tricky.

Can you describe the details of how you perform timing? There are a lot of gotchas (e.g., not using a time evaluator doesn’t synchronize the context so the execution is lazy. Another issue is that the first run on the GPU will be much slower than the subsequent runs as OpenCL needs to compile the kernel from source on-the-fly.


@siju-samuel I understood. It is the time taking for running the model. Thanks!
@eqy I didn’t consider first time run. Yeah, the opencl kernels are lazy to execute. Thanks!

I’ll update if I found anything tricky to solve this issue, perhaps it is structural. Also, I would like to involve in the development of OpenCL based TVM inference methods.

How about mini-batches?


Currently minibatch is supported on opencl.


What does “opencl kernel is lazy execution” mean ? thanks.


By default, if you simply record when the run call to an OpenCL function there is no guarantee that the call is complete when the call returns. To avoid this when timing, you can use a time_evaluator, or use another method for synchronization (e.g., reading back the result data).


I understand,It sounds just like unblock wait and block wait.


How about I don’t call get_output after calling run() ? May gpu excute opencl kernel delay?