TVM x86 resnet50_v1 on c5.9xlarge can have latency as low as 6.0 ms. However, the total execution time of conv2d in this network is around 3.5 ms, which is just 58% of end to end time. This means other non compute intensive ops take 2.5 ms. I dive a bit deeper into this issue, and found some fused conv2d ops are slower in resnet50 than just run a single fused op. For example:
n, ic, ih, iw = 1, 512, 7, 7
oc, _, kh, kw = 512, 512, 3, 3
sh, sw = 1, 1
ph, pw = 1, 1
oh = (ih + 2 * ph - kh) // sh + 1
ow = (iw + 2 * pw - kw) // sw + 1
dshape = (n, ic, ih, iw)
kshape0 = (oc, ic, kh, kw)
data = relay.var("data", shape=dshape)
kernel0 = relay.var("weight", shape=kshape0)
conv0 = relay.nn.conv2d(data, kernel0, (sh, sw), (ph, pw), (1, 1), kernel_size=(3, 3))
bn = relay.testing.layers.batch_norm_infer(data=conv0, epsilon=2e-5, name="bn")
out = relay.nn.relu(bn)
This simple conv2d-batchnorm-relu patten appears in resnet50. TVM generates a fused func " fused_nn_contrib_conv2d_NCHWc_add_nn_relu" for this patten, and this function will appear as the second last fused conv2d op in resnet50. If we just compile this simple network and profile it, the execution time for fused_nn_contrib_conv2d_NCHWc_add_nn_relu is 0.1ms, which is almost the same as the exec time of this conv2d op. However, in resnet50_v1, the exec time for this fused op becomes 0.2ms. I can’t see where the 0.1ms gap come from, since the lowered ir is the same for both fused ops in different network. Anyone has any idea on what can be the reason?