[VTA] Scalability for data center FPGAs

Hi @thierry,

I saw you mentioned “Address Scalability issue for data center FPGA”, in this “tvm-vta-architecture-scope-and-roadmap” topic, as i know VTA is configurable , by setting different “BATCH” “BLOCK” number, it should can provide difference “MAC/cycle” compute capability then can scale on different data center FPGA, could I know which scalability issue you think need to get addressed? are they related with any following topic?

Regards

Hua

     1. Reduce the DDR Visiting cost ?
         For resnet18 sample Currently every Conv2d need to output the 
         result into DDR and next Conv2d go to load data from DDR again 
        for compute, same DDR visiting happen at every conv2d, if data
         center FPGA have enough SRAM, we can storage this data in
         FPGA then no need do these DDR visiting.

    2.  Multiple Compute Core VTA
         Data center FPGA have more LUT/DSP, it may can support 
         256 * 256 PE, but for some CNN network, the output channel
        maybe less, for example maximum only 128 output channel, 
        in such case split 1 256*256 compute core into 4 128*128 
        compute core, seems like would generate more throughput by
        using pipeline for multiple input.

   3. Broadcasting or Systolic array
       VTA Compute do Vector * Matrix default, if vector can get broadcast
       into multiple column of Weight then V*M should can get compute
       in one cycle, I do not saw any broadcast related information in vta
      paper or code, not sure if this is a existing logic or not.
1 Like

Hi @hijiang,

This is an important topic: scaling VTA designs to large datacenter FPGAs like the F1.

  1. Keeping weights local to SRAM is an optimization that would require some graph-level support. I think it would be an interesting topic to work on, and would be happy to help draft and RFC with whoever from the community wants to push this forward. A small example actually would be to run a small char-RNN on VTA where weights are pinned on SRAM. Right now, the default behavior is to always load from DDR which is extremely inefficient.

  2. Scaling the design to multiple compute cores is an interesting direction as well, and perhaps would be a good fix to scale the HLS design (since HLS doesn’t like very large GEMMs). This one would require changes in the TVM IR passes for VTA to distribute work across threads; but we can extend the virtual threading to support physical threads.

  3. Correction: Matrix-Vector multiplication is done at a pipeline initiation interval of 1; meaning one matrix vector multiply per cycle, and not matrix vector multiply latency of 1 cycle (iteration latency). I do not understand the follow up sentence: the vector does get broadcasted against multiple weight columns.

Thierry

Hi @thierry,

Thanks for the detailed reply, following are some clarify and question.

  1. About Keeping weights local to SRAM, F1 like VU9P have 75.9MB Bram, seem like is ok for Resnet18 weight( I saw on your example the resnet18_qt8.paramsfile size is 44.6MB), could I know what is advantage for char-RNN, is the char-RNN would have a small param size, or because it may be a “data bound” case can better use the memory movement optimize benefit?

  2. about the (HLS doesn’t like very large GEMMs) , I am not very clearly understand, is that related hardware resource like DSP or pipeline module etc?

  3. “pipeline initiation interval of 1”, for a big matrix vector multiply like A(1,256) X B(256,256) under the hardware resource BLOCK size = 16(256 MAC/cycle), can reach 1 cycle per A(1,16) * B(16,16) Matrix Vector Multiply, but for small one like one A(1,16) * B(16,16), latency still be 4 cycle instead of 1 cycle. this issue may get worse in large cloud FPGA, for example google TPU support 64KMAC/cycle, (256,256) then a lot of Matrix V smaller then 256*256 and latency would can not be 1 cycle.

    broadcast mentioned in this paper(http://users.ece.utexas.edu/~gerstl/publications/sbac12.cc.pdf), it not rely on piple line and by using broadcast bus ,can reach MV multiply / 1 cycle, I am not sure if it can benefit VTA performance.

Regards

Hua

Hi @hjiang,

Good questions.

  1. Indeed, VU9P does have enough BRAM; I was suggesting using a smaller neural network to test the optimization on the pynq board for simplicity and availability.

  2. It has to do with the compiler trying to compile too large of a design; however, I haven’t tested the latest iteration 2019.2 which might have fixed some scalability issues.

  3. Thanks for the pointer to the paper. That’s an interesting way to scale the design - essentially the idea is to be able to broadcast a vector or matrix to multiple GEMM cores right?

Thierry

Hi @thierry,

Thanks for the reply, about the broadcasting you are right, the broadcasting should can help for multiple GEMM core solution. about the SRAM optimize, I can work with a draft RFC.

Regards

Hua

That would be fantastic; please @-me when you post the RFC.

Hi @thierry,

Thanks for the follow up, beside of the said 3 topic we discussed, do you think there are any other problem we need to address or thinking for “Cloud FPGA Scalability” ?

Regads

Hua

@hijiang - to address your question, there are the problems to address “cloud FPGA scalability”

  1. Add software support for PCI-E based hardware accelerator designs (heterogeneous graph partitioning, to utilize heterogeneous runtime support for VTA, explicit copy insertion, memory allocation mechanism on PCI-E board memory)
  2. Support mechanisms for scaling the number of VTA cores on a FPGA design; and compiler support for multiple VTA cores (think physical threading)
  3. Add runtime controlled caching mechanisms for weights in order to keep weights local to large SRAM buffers
  4. Add path for activation reuse in VTA to minimize SRAM-DRAM memory transfers (e.g. mem move from output buffer to input buffer)

These are some ideas I can think of on top of my head; @liangfu might be interested in this conversation too

I recommend reading A Domain-Specific Architecture for Deep Neural Networks, and I agree with the authors. Here are a notable comment in the article:

  • Compared to CPUs and GPUs, the single-threaded TPU has none of the sophisticated microarchitectural features that consume transistors and energy to improve the average case, but not the 99th percentile case; that is, there are no caches, branch prediction, out-of-order execution, multiprocessing, speculative prefetching, address coalescing, multithreading, context switching, and so forth. Minimalism is a virtue of domain-specific processors.

In addition, I would agree with @thierry that we would need runtime support and compiler support for cloud FPGA scalability.

That’s a good quote; keeping the design simple and “minimalist” is definitely a virtue. There are limitations to this single GEMM in cases where a single operator doesn’t expose enough parallelism to max out the FLOPs of the GEMM, in which case it’s preferable to break it into multiple GEMM cores to enable inter-operator parallelism.

I’m not quite sure of the exact meaning of the statement

did you mean

  1. the operator takes some extreme case when performing GEMM, (e.g. perform GEMM with extremely large weights with small inputs, or the other way round)
  2. the operator isn’t designed to be accelerate well on GEMM (like NMS layer in object detection)

As we are only considering the scalability to accelerate DNNs upon domain specific architecture (DSA), the second case I mentioned above should not be considered, IMHO.

To make GEMM more scalable: For the 1st case,

  • when we have small inputs for extremely large weights, we can pack multiple inputs into single GEMM when these inputs have shared weights;
  • when we have a large GEMM core, but multiple inputs don’t shared the same weights, we can pack multiple inputs and weights on the diagonal of the GEMM core;
  • when we have a extremely large number of inputs and the inputs share the same weights, we should scale up BATCH size in the GEMM core design. (Meanwhile, current GEMM core design is still scalable in my observation.)

If we can max out the FLOPS this way, I think “inter-operator parallelism” is not really necessary, as it breaks the rule of “simplicity”.

@liangfu @thierry, the discussion is awesome, following are some my point about the "single operator doesn’t expose enough parallelism ",

First I agree with this point, for my understand this is regards about how we block/parallel the operator like Conv2d and which things we can do parallel beyond current in/out channel scale solution on VTA.

For example, for a conv2d 64(in channel) 64(out channel) case , hardware with 4096 MAC capability would can fully parallel the current VTA GEMM compute, but if the hardware resource is more for example like TPUV1 which have 64k int8 MAC, to fully use these hardware resource, we may need to scale the single GEMM instruction on multiple core or we can say to parallel the compute in more “Dimension” for example Height/Weight.

If we can split the GEMM on more Dimension and running them in multiple core parallel, VTA should can get better performance in cloud FPGA and fully use the hardware resource , but the challenge is doing so may ask share memory in some scenario that would eat the performance gain and in other scenario may ask huge data band width that the hardware may not support.

About the pack solution, according my understanding, current vta seems like already worked in such way and the uop and iter_out field of gemm instruction is used for such compression/pack purpose.