CUDA async cudaMemcpyAsync/cudaMallocHost

I recently autotuned a mobilenet based model and was blown away at the inference speed. ~3ms for a 1,3,720,1280 input, as reported by the “time_evaluator”.

Because the raw inference speeds are now so fast, I wanted to experiment with lowering any latency associated to inference, eg, host->gpu and gpu -> host before jumping into batching, which doesn’t always make sense when your input is a live 30fps camera feed.

This is my first time ever using CUDA, so I don’t want to assert any expertise here.
I wanted to try out some of the cudaMemcpyAsync + pinned memory to hopefully squeeze a little less latency out.

My model has one input and 9 outputs.

Without the stock TVM runtime I got an average 7.7ms for:

gpuInputNDArray.CopyFrom(cpu_tensor.get());
graphRuntime->Run();
for(int i = 0; i < output_count; ++i)
{
const auto& net_output = graphRuntime->GetOutput(i);
/* this CopyTo is synchronous */
net_output.CopyTo(cpu_ouput_tensors[i].get());
}

With cudaMemcpyAsync + cudaMallocHost i got an average of 5.15ms

I did have to add TVMSynchronize to be included in the time measurement because things are async now.

/ * this should be an async copy now * /
gpuInputNDArray.CopyFrom(cpu_tensor.get());
graphRuntime->Run();
for(int i = 0; i < output_count; ++i)
{
const auto& net_output = graphRuntime->GetOutput(i);
/* this CopyTo is asynchronous now and cpu_ouput_tensors will not have values until TVMSynchronize() */
net_output.CopyTo(cpu_ouput_tensors[i].get());
}
TVMSynchronize(_tvm_context.device_type, _tvm_context.device_id, nullptr);
/ * cpu_output_tensors can be read now device is synced */

I know its only an improvement of ~2.5ms latency, but given the inference time was only around 3ms, I took this as a win of sorts.

Here’s a link to the diff

Usage is simply:

TVMArrayAlloc(net_output_shape.data(),
net_output_shape.size(),
dtype_code,
dtype_bits,
dtype_lanes,
kDLGPU, /* Kept this kDLGPU so it would go into the cuda_device_api.cc*/
CPU_PINNED_DEVICE /* 0xffff, special device id looked for in cuda_device_api.cc*/,
&p_dl_output);

I know this is a bit of a hack, and don’t expect anyone to use this code, but was ultimately wondering opinions on if using the cuda async could have any benefits beyond saving 2.5ms latency or if this is just a mostly useless micro optimization.

This is an interesting proposal. We can certainly introduce a cpu_pinned device. Perhaps we can set a different device type as in https://github.com/dmlc/dlpack/blob/0acb731e0e43d15deee27b66f10e4c5b4e667913/include/dlpack/dlpack.h#L47

and add a separate DeviceAPI for these types of memory.