C++ API test running on NVIDIA GPU with different betch_size and different repeat, run time increased


I wanted to test the run time of a model with different betch_size using C++ API.
First,converted a caffe model(vgg16) to mxnet,and compiled.
Then,changed the batch_size and repeat count with different size.
Finally,I found when betch_size = 64, the average time increased with the repeat count.
Also, free GPU DLTensor consumed more time.
Anyone can help me? Thanks!
Part of code:
tvm::runtime::PackedFunc run = mod.GetFunction(“run”);
for(int i = 0; i < repeat; ++i)

betch_size=1 repeat=1 runtime=1 freetime(input_gpu)=4
betch_size=1 repeat=10 runtime=2 freetime(input_gpu)=43
betch_size=1 repeat=100 runtime=349 freetime(input_gpu)=78
betch_size=1 repeat=1000 runtime=4098 freetime(input_gpu)=79

betch_size=64 repeat=1 runtime=1 freetime(input_gpu)=364
betch_size=64 repeat=10 runtime=1 freetime(input_gpu)=3506
betch_size=64 repeat=100 runtime=22952 freetime(input_gpu)=11889
betch_size=64 repeat=1000 runtime=340881 freetime(input_gpu)=12023