[x86] Performance gap for the same op function in different graph

kevinthesun · May 10, 2019, 11:38pm

TVM x86 resnet50_v1 on c5.9xlarge can have latency as low as 6.0 ms. However, the total execution time of conv2d in this network is around 3.5 ms, which is just 58% of end to end time. This means other non compute intensive ops take 2.5 ms. I dive a bit deeper into this issue, and found some fused conv2d ops are slower in resnet50 than just run a single fused op. For example:

n, ic, ih, iw = 1, 512, 7, 7
oc, _, kh, kw = 512, 512, 3, 3
sh, sw = 1, 1
ph, pw = 1, 1
oh = (ih + 2 * ph - kh) // sh + 1
ow = (iw + 2 * pw - kw) // sw + 1

dshape = (n, ic, ih, iw)
kshape0 = (oc, ic, kh, kw)

data = relay.var("data", shape=dshape)
kernel0 = relay.var("weight", shape=kshape0)
conv0 = relay.nn.conv2d(data, kernel0, (sh, sw), (ph, pw), (1, 1), kernel_size=(3, 3))
bn = relay.testing.layers.batch_norm_infer(data=conv0, epsilon=2e-5, name="bn")
out = relay.nn.relu(bn)

This simple conv2d-batchnorm-relu patten appears in resnet50. TVM generates a fused func " fused_nn_contrib_conv2d_NCHWc_add_nn_relu" for this patten, and this function will appear as the second last fused conv2d op in resnet50. If we just compile this simple network and profile it, the execution time for fused_nn_contrib_conv2d_NCHWc_add_nn_relu is 0.1ms, which is almost the same as the exec time of this conv2d op. However, in resnet50_v1, the exec time for this fused op becomes 0.2ms. I can’t see where the 0.1ms gap come from, since the lowered ir is the same for both fused ops in different network. Anyone has any idea on what can be the reason?

merrymercy · May 12, 2019, 8:52am

We have encountered this problem. We suspect that when we run the network multiple times for profiling, in a whole big network, the weight of a conv2d have to be loaded to cache every time. But in a small network, the weight is always in cache.

See related discussion (Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results) where some modifications to time evaluator are proposed.
You can hack the time evaluator and confirm it.

kevinthesun · May 11, 2019, 9:08pm

Thanks! This is helpful. I’ll look into that direction.