Hello, I have a question about VTA hardware.
While trying to understand how VTA hardware works, it is difficult to know how dependence queues work.
By extracting instructions with debug_flag=0x6, the following instructions are shown.
INSTRUCTION 0: NOP-STORE-STAGE
dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0
l2g_queue = 0, g2l_queue = 0
s2g_queue = 1, g2s_queue = 0
INSTRUCTION 1: NOP-STORE-STAGE
dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0
l2g_queue = 0, g2l_queue = 0
s2g_queue = 2, g2s_queue = 0
INSTRUCTION 2: LOAD UOP
dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0
DRAM: 0x18600000, SRAM:0x0000
y: size=1, pad=[0, 0]
x: size=8, stride=8, pad=[0, 0]
l2g_queue = 0, g2l_queue = 0
s2g_queue = 1, g2s_queue = 0
INSTRUCTION 3: GEMM
dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0
reset_out: 1
range (0, 8)
outer loop - iter: 56, wgt: 0, inp: 0, acc: 1
inner loop - iter: 2, wgt: 0, inp: 0, acc: 448
l2g_queue = 0, g2l_queue = 1
s2g_queue = 1, g2s_queue = 0
INSTRUCTION 4: LOAD UOP
dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0
DRAM: 0x18600008, SRAM:0x0008
y: size=1, pad=[0, 0]
x: size=8, stride=8, pad=[0, 0]
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 5: GEMM
dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0
reset_out: 1
range (8, 16)
outer loop - iter: 56, wgt: 0, inp: 0, acc: 1
inner loop - iter: 2, wgt: 0, inp: 0, acc: 448
l2g_queue = 0, g2l_queue = 2
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 6: LOAD INP
dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0
DRAM: 0x06040000, SRAM:0x0000
y: size=9, pad=[1, 0]
x: size=56, stride=56, pad=[1, 1]
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 7: LOAD WGT
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1
DRAM: 0x00600d00, SRAM:0x0000
y: size=2, pad=[0, 0]
x: size=9, stride=36, pad=[0, 0]
l2g_queue = 1, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 8: LOAD INP
dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0
DRAM: 0x06040000, SRAM:0x0244
y: size=9, pad=[1, 0]
x: size=56, stride=56, pad=[1, 1]
l2g_queue = 1, g2l_queue = 0
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 9: LOAD WGT
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1
DRAM: 0x00600d48, SRAM:0x0012
y: size=2, pad=[0, 0]
x: size=9, stride=36, pad=[0, 0]
l2g_queue = 2, g2l_queue = 0
s2g_queue = 0, g2s_queue = 0
I guessed that it up to instruction 5 is a initializing phase since it has reset_out, and LOAD INP and LOAD WGT should be presented before LOAD UOP and GEMM.
So I listed what is happening in dependence queues.
However, even seeing these queues, Iām not sure how task-level parallelism is achieved.
Can anyone tell me how this mechanism works?
Thank you for your help.