How does VTA handle tiling's spatial dependency?


Anybody know how does VTA handle spatial dependency per tile?
In the simple case as 3x3 conv layer, how does VTA handle spatial dependency when DMA input data? Do we always overlapped transfer one more line on top / bottom for each tile (if we are doing 1D slicing to ignore left/right to simplify the problem)?


Looking at the code here how transfer size is calculated:

seems TVM will always do overlapped transfer to satisfy spatial dependency for stencil operation instead of taking benefit from on-chip circular buffer?

Questions about memory latency hiding on DSP


Yes I also think the VTA example does retransmit the neighbouring row/columns of the tile.
I thought this was mostly the case because the VTA architecture does not have any kind of ring buffers.

Where are you getting the information that it does?


From no where.
I was just intuitively thinking for accelerator, like DSP with SW controlled DMA, circular buffer is the way to save bandwidth and expecting AI compiler will proper model and optimize this part but seems TVM not yet.


Which DSP are you referring to? (I’m just curious because I don’t really know much about DMA controlled ring buffers)

The VTA doesn’t have them so even if TVM had the ability to support no overlapping for tiling then looking at the VTA example would not give you evidence that it is not supported.

In the TVM conference Qualcomm was there and showed efforts to use TVM to compile for the Hexagon.
Since that’s a DSP you can probably look into what does TVM support there.


DMA is for data movement only and won’t be aware of ring buffers. Ring buffers will be purely managed by SW to tell DMA where to write data into. So AI compiler (like TVM) needs to be aware/model the access pattern to let ring buffer can satisfy the dependency across iteration instances (like filter the input line by line) and proper control when release/reuse the ring buffers. This is what I am looking into TVM to see whether supported and how this information is represented in its IR and what transformation can be supported.

Hexagon, thanks for the information but I think it is cache based (is it right? hope my memory is correct) and share the global memory with host which can be different.


Faced with the same problem. Check out Questions about memory latency hiding on DSP.
It seems that the memory latency hiding method in TVM now is not suitable for DSP.
Looking forward to more guide.


I have one question, when you guys talk about DSP which processor are you actually referring to?


We are talking about different chips and which particular one is not important. If you want an example, this can be one I think.


Thanks for that!
This paper realy clearify the specific problems on DSP. But it seems that the sw implementation of double buffer or circular buffer haven’t been mentioned.
Do you have any idea or know about similar work about that?