Dear @thierry,
Hello Thierry, I’ve understood a lot of things what’s going on VTA thanks to your detailed replies.
I’m trying to measure how much overlapping is done in VTA.
In doing so, I have a question about ALU in TLPP.
I observed how task parallelism is done between LD INP, LD WGT and GEMM using virtual thread.
Also, it is possible to measure overlap between the two by using dependence queue.
For instance, GEMM will read l2g_valid signal twice and write g2l_valid signal twice. (Correct me if I’m wrong)
However, ALU behavior seems quiet different which makes me confused.
The following is part of the debug information of quantized ResNet-18.
INSTRUCTION 31: GEMM
dep - pop prev: 1, pop next: 0, push prev: 1, push next: 0
reset_out: 0
range (64, 112)
outer loop - iter: 56, wgt: 0, inp: 1, acc: 1
inner loop - iter: 3, wgt: 1, inp: 1, acc: 0
l2g_queue = 0, g2l_queue = 2
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 32: NOP-MEMORY-STAGE
dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 33: LOAD ACC
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0
DRAM: 0x017fb3c0, SRAM:0x0380
y: size=1, pad=[0, 0]
x: size=2, stride=2, pad=[0, 0]
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 34: LOAD UOP
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0
DRAM: 0x18600070, SRAM:0x0070
y: size=1, pad=[0, 0]
x: size=1, stride=1, pad=[0, 0]
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
INSTRUCTION 35: ALU - add
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0
reset_out: 0
range (112, 113)
outer loop - iter: 2, dst: 448, src: 1
inner loop - iter: 448, dst: 1, src: 0
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 0
...
INSTRUCTION 58: ALU - max imm
dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1
reset_out: 0
range (123, 124)
outer loop - iter: 1, dst: 0, src: 0
inner loop - iter: 896, dst: 1, src: 1
l2g_queue = 0, g2l_queue = 1
s2g_queue = 0, g2s_queue = 2
First, I can’t really understand NOP-MEMORY-STAGE.
Guessing from runtime.cc, it seems no-opereation for pending pops but I have no clue what is happening.
Second, I guess LOAD ACC is loading bias informations into accumulator memory and using UOP, adds loaded ACC and store back to accumulator memory.
I can detect the end of ALU operation by g2s valid signal which gives dependence to STORE, I’m not sure how to detect the start of the operation.
Thank you for your help.
- jwlee.