VTA Conv2d Optimization Schedule and Optimal Throughput

I was looking at the conv2d optimization tutorial https://docs.tvm.ai/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py. And I was using the cycle accurate simulator. With the GEM size=16x16, I get tsim running with 1.8M clock cycles. This is 4x slower than the case if it’s computation bound. (tsim cannot model DRAM bandwidth so it cannot be memory bound).
I am wondering if this schedule is not optimal. I play around with different schedule with the compute at and reorder primitive, but it seems not improve the performance.
If this is the case, could you let me know where I can find the optimal schedule for this layer/ResNet-18.

Thanks.

@joyliu37 thanks for looking into this. There is at the moment two VTA design sources: the initial design (which was used in the TVM and VTA papers) that was generated with HLS - this is the design that one can test and deploy on the Pynq/Ultra96 boards and run workloads like Resnet-18. We’ve also ran tuning on this design to obtain the close to “compute bound” performance on the device (as shown by the roofline plots in the TVM paper). The reason it’s not 100% compute bound is because the GEMM and ALU share the same task-level pipeline stage.
The second design (which is specified in Chisel, and supports cycle accurate simulation) is a new addition and is under development/refinement.

Finally on TSIM not modeling DRAM bandwidth: we will be bandwidth limited due to port width. It might not incorporate a latency model, but it should throttle DRAM access due to the memory interface width.

Also for reproducibility sake, can you give a pointer to the conv2d workload you ran on the pynq? It might be worth looking into what’s happening inside of the Chisel-based VTA design so we can improve it. Adding @vegaluis to the thread.

Thanks for the reply. I have not got a chance running on the FPGA board, but just ran the Chisel-based cycle accurate simulator. From your description, I think the reason is that Chisel-based design is not 100% the same as Vivado-HLS. I was running the conv2D layer in tutorial: https://docs.tvm.ai/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py. I would expect a 10-20% throughput difference with the optimal case, but 4x slower seems something went wrong.