Hi @thierry,
I saw you mentioned “Address Scalability issue for data center FPGA”, in this “tvm-vta-architecture-scope-and-roadmap” topic, as i know VTA is configurable , by setting different “BATCH” “BLOCK” number, it should can provide difference “MAC/cycle” compute capability then can scale on different data center FPGA, could I know which scalability issue you think need to get addressed? are they related with any following topic?
Regards
Hua
1. Reduce the DDR Visiting cost ?
For resnet18 sample Currently every Conv2d need to output the
result into DDR and next Conv2d go to load data from DDR again
for compute, same DDR visiting happen at every conv2d, if data
center FPGA have enough SRAM, we can storage this data in
FPGA then no need do these DDR visiting.
2. Multiple Compute Core VTA
Data center FPGA have more LUT/DSP, it may can support
256 * 256 PE, but for some CNN network, the output channel
maybe less, for example maximum only 128 output channel,
in such case split 1 256*256 compute core into 4 128*128
compute core, seems like would generate more throughput by
using pipeline for multiple input.
3. Broadcasting or Systolic array
VTA Compute do Vector * Matrix default, if vector can get broadcast
into multiple column of Weight then V*M should can get compute
in one cycle, I do not saw any broadcast related information in vta
paper or code, not sure if this is a existing logic or not.