I was looking at the conv2d optimization tutorial https://docs.tvm.ai/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py. And I was using the cycle accurate simulator. With the GEM size=16x16, I get tsim running with 1.8M clock cycles. This is 4x slower than the case if it’s computation bound. (tsim cannot model DRAM bandwidth so it cannot be memory bound).
I am wondering if this schedule is not optimal. I play around with different schedule with the compute at and reorder primitive, but it seems not improve the performance.
If this is the case, could you let me know where I can find the optimal schedule for this layer/ResNet-18.