How does VTA handle tiling's spatial dependency?

czhu · February 27, 2019, 11:17pm

Anybody know how does VTA handle spatial dependency per tile?
In the simple case as 3x3 conv layer, how does VTA handle spatial dependency when DMA input data? Do we always overlapped transfer one more line on top / bottom for each tile (if we are doing 1D slicing to ignore left/right to simplify the problem)?
Thanks.

czhu · February 28, 2019, 3:26pm

Looking at the code here how transfer size is calculated:

github.com

dmlc/tvm/blob/master/vta/python/vta/top/vta_conv2d.py#L48


env = get_env()


# Helper function to get factors
def _find_factors(n):
    factors = []
    for f in range(1, n + 1):
        if n % f == 0:
            factors.append(f)
    return factors


def _get_data_movement_byte(schedule, layer):
    """ Estimate data movement in bytes for the schedule plan
    """
    env = get_env()
    b_f = schedule.b_factor
    h_f = schedule.h_factor
    w_f = schedule.w_factor
    ci_f = schedule.ic_factor
    co_f = schedule.oc_factor
    # Derive data movement
    inp_elem_sizeb = env.BATCH * env.BLOCK_IN * env.INP_WIDTH

github.com

dmlc/tvm/blob/master/vta/python/vta/top/vta_conv2d.py#L61


env = get_env()
b_f = schedule.b_factor
h_f = schedule.h_factor
w_f = schedule.w_factor
ci_f = schedule.ic_factor
co_f = schedule.oc_factor
# Derive data movement
inp_elem_sizeb = env.BATCH * env.BLOCK_IN * env.INP_WIDTH
wgt_elem_sizeb = env.BLOCK_IN * env.BLOCK_OUT * env.WGT_WIDTH
out_elem_sizeb = env.BATCH * env.BLOCK_OUT * env.OUT_WIDTH
input_tile_elems = b_f * \
        ((h_f - 1) * layer.hstride + layer.hkernel) * \
        ((w_f - 1) * layer.wstride + layer.wkernel) * ci_f
weight_tile_elems = layer.hkernel * layer.wkernel * ci_f
output_tile_elems = b_f * h_f * w_f * co_f
# Derive tiling factors
b_factor = layer.batch // (b_f * env.BATCH)
h_factor = (layer.height // layer.hstride) // h_f
w_factor = (layer.width // layer.wstride) // w_f
ci_factor = layer.in_filter // (ci_f * env.BLOCK_IN)
co_factor = layer.out_filter // (co_f * env.BLOCK_OUT)

seems TVM will always do overlapped transfer to satisfy spatial dependency for stencil operation instead of taking benefit from on-chip circular buffer?

aca88 · March 4, 2019, 8:30am

Hi,

Yes I also think the VTA example does retransmit the neighbouring row/columns of the tile.
I thought this was mostly the case because the VTA architecture does not have any kind of ring buffers.

Where are you getting the information that it does?

czhu · March 5, 2019, 3:26pm

From no where.
I was just intuitively thinking for accelerator, like DSP with SW controlled DMA, circular buffer is the way to save bandwidth and expecting AI compiler will proper model and optimize this part but seems TVM not yet.

aca88 · March 5, 2019, 4:09pm

Which DSP are you referring to? (I’m just curious because I don’t really know much about DMA controlled ring buffers)

The VTA doesn’t have them so even if TVM had the ability to support no overlapping for tiling then looking at the VTA example would not give you evidence that it is not supported.

In the TVM conference Qualcomm was there and showed efforts to use TVM to compile for the Hexagon.
Since that’s a DSP you can probably look into what does TVM support there.

czhu · March 5, 2019, 4:51pm

DMA is for data movement only and won’t be aware of ring buffers. Ring buffers will be purely managed by SW to tell DMA where to write data into. So AI compiler (like TVM) needs to be aware/model the access pattern to let ring buffer can satisfy the dependency across iteration instances (like filter the input line by line) and proper control when release/reuse the ring buffers. This is what I am looking into TVM to see whether supported and how this information is represented in its IR and what transformation can be supported.

Hexagon, thanks for the information but I think it is cache based (is it right? hope my memory is correct) and share the global memory with host which can be different.

XinchengHan · March 6, 2019, 1:31am

Faced with the same problem. Check out Questions about memory latency hiding on DSP.
It seems that the memory latency hiding method in TVM now is not suitable for DSP.
Looking forward to more guide.

aca88 · March 6, 2019, 6:37am

I have one question, when you guys talk about DSP which processor are you actually referring to?

czhu · March 6, 2019, 1:58pm

We are talking about different chips and which particular one is not important. If you want an example, this https://dl.acm.org/citation.cfm?id=3106343 can be one I think.

XinchengHan · March 8, 2019, 1:37am

Thanks for that!
This paper realy clearify the specific problems on DSP. But it seems that the sw implementation of double buffer or circular buffer haven’t been mentioned.
Do you have any idea or know about similar work about that?