As I’m working on yolo, I developed the opencl based tvm model. I’m able to reduce the inference time of such big model to 0.01s/image, but gpu to cpu copying speed made it worthless as it is taking ~4s . I tried tvm_opencl_sync() but helpless. I go through the opencl_api.c and found it was already at the best possible method to copy data (using clenqueue and clbuffer methods, support for opencl >1.2). I read that opencl_sync() makes a cpu core to synchronize with gpu as they are asynchronous processes, can we ‘globally lock’ the opencl_sync() to a seperate cpu_core using multiprocessing? would that work and a better idea? Or should I manually manage using multiprocessing to set and get inputs and not making cpu idle?
Any tips/suggestions ?