Tvm 's detection speed is so slower than Mxnet on SSD-Mobilenet


First, I run the following codes on rk3399 for object detection:

it’s detection speed is about 10fps, just including frame capture, data transform, and inference. NO Disppay.

However, when use the TVM to compile the model for detection, its speed is about 2fps !!

The following is the compile codes:

#--------------------------------------------- Local PC --------------------------------------------------
block = model_zoo.get_model(“ssd_512_mobilenet1.0_voc”, pretrained=True)

dshape = (1, 3, 240, 320)
net, params = relay.frontend.from_mxnet(block, {“data”: dshape})

opt_level = 3
target =‘llvm -device=arm_cpu -target=aarch64-linux-gnu’)
with relay.build_config(opt_level=opt_level):
graph, lib, params =, target, params=params)

#----------------------------------------------- RK3399 -------------------------------------------------
slight_smile: ctx = tvm.cpu()
m = graph_runtime.create(loaded_graph, loaded_lib, ctx)

axes = None
cap = cv2.VideoCapture(0)
cap.set(3, 320)
cap.set(4, 240)

start = time.time()
# Capture frame-by-frame
ret, frame =

# Image pre-processing
frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
rgb_nd, frame =, short=240, max_size=480)

# Do inference
tvm_input = tvm.nd.array(rgb_nd.asnumpy(), ctx=ctx)
m.set_input('data', tvm_input)

	# execute

	# get outputs
class_IDs, scores, bounding_boxs = m.get_output(0), m.get_output(1), m.get_output(2)

# Compute the fps
end = time.time()
seconds = end - start
fps = 1 / seconds
print("Estimated frames per second : {0}".format(fps))

# Display result
axes = gcv.utils.viz.plot_bbox(frame, bounding_boxs.asnumpy()[0], scores.asnumpy()[0], class_IDs.asnumpy()[0], class_names=block.classes, ax=axes)
# plt.draw()


who can help to explain this issue?


I met a similar problem, with only opt-level = 3 build but without auto-tune, the gluoncv ssd model on cuda is 5x slower than mxnet. Currently gluoncv should have full support in TVM, is there a benchmark or test or official speed up ratio data for share? And what might be the possible problem in our usage? Thanks a lot!! @Laurawly


@kuonangzhe We haven’t benchmarked the performance on servers yet. Currently, we only focused on embedded GPUs. But I suggest auto-tuning the convolutions. We’ll share the benchmark once they are ready.


@zzw The default schedule in upstream tvm hasn’t been auto-tuned for object detection workload on arm cpu. So the inference time is not optimized. I suggest you to auto tune first.


Many thanks for reply! The point is not about embedded GPU or server GPU. Currently GluonCV SSD has a much slower speed than mxnet on no matter intel cpu(llvm), arm cpu, or nvidia gpu(cuda), which is frustrating. Yeah I will try cuda’s auto-tuning on my side to see if it works for SSD model, if you think it is the core solution for this part. I’ll update later when I get result. Thanks a lot~ @Laurawly


I have tried to auto-tuning the model (SSD-MobileNet) for object detection, but there are something wrong.

The bug of auto-tuning [parameters setting]


Hi, sorry for late reply. I used autotvm to tune the gluoncv ssd with resnet50_voc. On 1080Ti, The inference time are as follows:
mxnet: 0.03 s
tvm: 0.8 s
tvm with autotvm: 0.6 s
So it seems like there’s no obvious improvement on SSD and speed is still really slow. Are there any other suggestions?


There is a known issue regarding to concatenation performance on gpu causing this: Explore Optimizations for Concat


I have the same problem


Have you tried to update the concat? One possible solution is making it opaque like this pr: