[VTA] Question about ALU in quantized ResNet18

jwlee · November 6, 2019, 3:14am

Dear @thierry,

Hello Thierry, I’ve understood a lot of things what’s going on VTA thanks to your detailed replies.

I’m trying to measure how much overlapping is done in VTA.

In doing so, I have a question about ALU in TLPP.

I observed how task parallelism is done between LD INP, LD WGT and GEMM using virtual thread.

Also, it is possible to measure overlap between the two by using dependence queue.

For instance, GEMM will read l2g_valid signal twice and write g2l_valid signal twice. (Correct me if I’m wrong)

However, ALU behavior seems quiet different which makes me confused.

The following is part of the debug information of quantized ResNet-18.

INSTRUCTION 31: GEMM

dep - pop prev: 1, pop next: 0, push prev: 1, push next: 0

reset_out: 0

range (64, 112)

outer loop - iter: 56, wgt: 0, inp: 1, acc: 1

inner loop - iter: 3, wgt: 1, inp: 1, acc: 0

l2g_queue = 0, g2l_queue = 2

s2g_queue = 0, g2s_queue = 0

INSTRUCTION 32: NOP-MEMORY-STAGE

dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0

l2g_queue = 0, g2l_queue = 1

s2g_queue = 0, g2s_queue = 0

INSTRUCTION 33: LOAD ACC

dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0

DRAM: 0x017fb3c0, SRAM:0x0380

y: size=1, pad=[0, 0]

x: size=2, stride=2, pad=[0, 0]

l2g_queue = 0, g2l_queue = 1

s2g_queue = 0, g2s_queue = 0

INSTRUCTION 34: LOAD UOP

dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0

DRAM: 0x18600070, SRAM:0x0070

y: size=1, pad=[0, 0]

x: size=1, stride=1, pad=[0, 0]

l2g_queue = 0, g2l_queue = 1

s2g_queue = 0, g2s_queue = 0

INSTRUCTION 35: ALU - add

dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0

reset_out: 0

range (112, 113)

outer loop - iter: 2, dst: 448, src: 1

inner loop - iter: 448, dst: 1, src: 0

l2g_queue = 0, g2l_queue = 1

s2g_queue = 0, g2s_queue = 0

...

INSTRUCTION 58: ALU - max imm
	dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1
	reset_out: 0
	range (123, 124)
	outer loop - iter: 1, dst: 0, src: 0
	inner loop - iter: 896, dst: 1, src: 1
	l2g_queue = 0, g2l_queue = 1
	s2g_queue = 0, g2s_queue = 2

First, I can’t really understand NOP-MEMORY-STAGE.

Guessing from runtime.cc, it seems no-opereation for pending pops but I have no clue what is happening.

Second, I guess LOAD ACC is loading bias informations into accumulator memory and using UOP, adds loaded ACC and store back to accumulator memory.

I can detect the end of ALU operation by g2s valid signal which gives dependence to STORE, I’m not sure how to detect the start of the operation.

Thank you for your help.

jwlee.

hjiang · November 6, 2019, 7:24pm

Hi @jwlee,

I ever worked in related logic in vta, and think may can provide some related information, @thierry please correct me if my answer is wrong, following are the comments.

Regards
Hua

Q1. “GEMM will read l2g_valid signal twice and write g2l_valid signal twice”
A. one GEMM only read l2g one time and write g2l one time, because, LD INP and
LD WGT is serialize running in same Load core and only after both INP and WGT
ready then compute can work. hence no twice wait happen, g2l is same logic.

Q2. “can’t really understand NOP-MEMORY-STAGE”
A. The purpose NOP-** is use for make queue like l2g, g2s etc be empty after
all instruction done, without these NOP* instruction, for example in the case
store is last instruction before FINISH then s2c queue size would be 1 instead
of 0 after all logic done.

Q3. " I guess LOAD ACC is loading bias informations…"
A. yes, LOAD ACC load the bias accumulator buffer and GEMM use same accumulator
buffer for Matrix production storage to involve bias.

Q4. “how to detect the start of the operation”
A. I guess your probably means “why there is no any pop dependency for alu
and how alu make sure it’s data ready”, the reason is ALU running on same
IP core with GEMM and it is always running after GEMM, then it always serialized
with GEMM and no need to do pop because GEMM already did that, the reason
why this happen is because in TVM/VTA upper layer , there is a fuse operation like
this “fuse_conv2d___rshift_scalar___clip_cast*”, hence conv2d and alu always be
serialized happen in compute core.

jwlee · November 7, 2019, 1:03am

Wow thank you for such a detailed answer.

Didn’t know that ALU always follows GEMM.

I’ll study based on your reply.

Regards,

jwlee

thierry · November 7, 2019, 5:46pm

@hjiang thanks for the detailed answer to @jwlee, you are correct on all of those points! To add to what he said, one way to think about NOP-MEMORY-STAGE is that it’s simply a barrier to ensure that two concurrent threads are done performing computation.

jwlee · November 8, 2019, 1:04am

Ah-ha! Now I got the idea of NOP-MEMORY-STAGE. Thank you @thierry.

jwlee · November 12, 2019, 2:23am

Dear @hjiang and @thierry,

I have one more question about quantized ResNet18.

I observed that VTA is invoked serveral times when deploying quantized ResNet18.

VTA starts at instruction fetch and ends with instruction with FINISH.

I’m curious about the unit (granularity?) lthat invokes VTA several times.

Thank you,

jwlee

hjiang · November 12, 2019, 9:52pm

Hi @jwlee,

For Resnet18 example, every ‘Fetch to Finish Insns’/VTADeviceRun correspond a conv2d fuse.

Regards

Hua