VTA for new FPGA

Dear community members:

I had read throught the vta paper 1807.04188v3 and would like to ask a few questions about hardware customization for new FPGA.

----- quote -----
p.1: a microcode-ISA which implements a wide variety of operators with
single-cycle tensor-tensor operations.

p.5: Architectural knobs include
GEMM hardware intrinsic shape, data types, number of parallel
arithmetic units in the tensor ALU, ALU operations, BRAM
distribution between on-chip memories.

----- questions ----

  1. Can I change the GEMM / ALU cores to be multi-cycle, in order to enable the pipeline inside the GEMM / ALU cores and increase the FMax?

  2. It’s said the number of parallel ALU can be customized. How about the number of parallel GEMM cores? Suppose I have a very large FPGA with one million of LUT-6 andthousands of h/w MAC, how can I maximize the utilization of such resouces by implementing multiple parallel GEMM cores?

  3. How can I change the source code to use new type of offchip memory ? e.g. DDR3 -> GDDR6 ?
    How to specify the DDR controller and DDR memory characteristics such as bus throughput and latency?

  4. Shall I change the tsim code as well, in order to reflect the changes in new FPGA, so that the tsim simulation can accurately represent the actuall hardware performance and fuction ?

Thanks very much!

Kevin

To answer your questions @kevinyuan

  1. single cycle GEMM/ALU actually means initiation interval of 1 meaning that the pipeline performs once GEMM per cycle. Therefore it’s already pipelined to enable higher timing closure (for instance for VTA the pipeline iteration latency is around 10 cycles)
  2. Regarding the ALU customization that means the size of the vector ALU. If let’s say we have an output tensor shape of [1x16] in VTA’s register file, that means that we can evaluate a vector add in 2 cycles if we use an 8wide ALU, or 4 cycles if we use a 2wide ALU etc. There is a functionality to scale the GEMM core too in order to take more cycles to evaluate a GEMM in order to utilize less resources while keeping the interface the same.
  3. In order to modify the design, as long as the memory controller talks AXI protocol the interfaces should be easily swappable.
  4. TSIM scaffolding should not have to change if you modify the hardware design, but you may need to change TVM based on what you are modifying in hardware (in order to update the compiler)

Very appreciate @thierry, I am so happy that I can learn extra knowledges here beyond the papers :smile: