How to read TVM code?

ricann · November 21, 2018, 9:47am

Study tvm about two months, still very confused the details in TVM:

Data structure is so complicated
use python and c++ together, also hard to understand

Take Simple Matrix Multiply as an example, I don’t know how VTA code generated from TVM?

need help very very much …

thierry · November 22, 2018, 7:52am

The simple matrix multiply example should provide a walk through on applying schedule transformations to a TVM schedule.

One easy way to get how the schedule is massaged is to use the debug prints between each step along the way: print(tvm.lower(s, [A, B, C], simple_mode=True))

This should give you a good understanding of what’s happening internally between TVM schedule transformations.

When you get the lowered code (that calls into the VTA runtime), it helps to understand the VTA design itself to understand how these runtime API calls program the accelerator.

ricann · November 22, 2018, 9:09am

I have printed the execution procedure, even VTA’s instructions:

INSTRUCTION 0: LOAD ACC
	dep - pop prev: 0, pop next: 0, push prev: 0, push next: 0
	DRAM: 0x00000040, SRAM:0x0000
	y: size=1, pad=[0, 0]
	x: size=64, stride=64, pad=[0, 0]
	l2g_queue = 0, g2l_queue = 0
	s2g_queue = 0, g2s_queue = 0

I don’t know how a computation description transformed to VTA code, can you give me some glue?

thierry · November 22, 2018, 6:59pm

I see that you can look into the VTA instructions, which means you’ve looked into the design quite a bit.

I assume you’ve read the technical references, and the tech report on VTA as well?

The key to understand how TVM produces VTA code is to start from the lowered TVM schedule (see the matrix multiplication example that you linked) and how you end up with a lowered schedule like this one:

// attr [C_buf] storage_scope = "local.acc_buffer"
// attr [A_buf] storage_scope = "local.inp_buffer"
// attr [B_buf] storage_scope = "local.wgt_buffer"
produce C_buf {
  // attr [iter_var(vta, , vta)] coproc_scope = 2
  // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
  VTAUopLoopBegin(16, 1, 0, 0)
  VTAUopPush(0, 1, 0, 0, 0, 0, 0, 0)
  VTAUopLoopEnd()
  vta.coproc_dep_push(2, 1)
  for (ko, 0, 16) {
    // attr [iter_var(vta, , vta)] coproc_scope = 1
    vta.coproc_dep_pop(2, 1)
    produce A_buf {
      VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), A, ko, 1, 1, 1, 0, 0, 0, 0, 0, 2)
    }
    produce B_buf {
      VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), B, ko, 1, 16, 16, 0, 0, 0, 0, 0, 1)
    }
    vta.coproc_dep_push(1, 2)
    // attr [iter_var(vta, , vta)] coproc_scope = 2
    vta.coproc_dep_pop(1, 2)
    // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
    VTAUopLoopBegin(16, 1, 0, 1)
    VTAUopPush(0, 0, 0, 0, 0, 0, 0, 0)
    VTAUopLoopEnd()
    vta.coproc_dep_push(2, 1)
  }
  vta.coproc_dep_push(2, 3)
  vta.coproc_dep_pop(2, 1)
}
// attr [iter_var(vta, , vta)] coproc_scope = 3
vta.coproc_dep_pop(2, 3)
produce C {
  VTAStoreBuffer2D(tvm_thread_context(VTATLSCommandHandle()), 0, 4, C, 0, 16, 1, 16)
}
vta.coproc_sync()

thierry · November 22, 2018, 7:02pm

The next step is to understand what exactly these calls do: for instance what does VTALoadBuffer2D() do to produce VTA instructions?

This is specified inside of the VTA runtime: https://github.com/dmlc/tvm/blob/master/vta/src/runtime.cc

The runtime does JIT compilation. In other words, it runs on the ARM core of the Zynq SoC and produces the VTA binary that you printed out. Understanding how the runtime does JIT compiling is key in understanding the gap between lowered TVM code, and the code that executed on VTA.

ricann · November 23, 2018, 8:15am

When execution is in VTALoadBuffer2D, I can find every executed code line from the source code:

1. first:
void VTALoadBuffer2D(...) {
  static_cast<vta::CommandQueue*>(cmd)->LoadBuffer2D(...);
}

2. and then trace LoadBuffer2D:
  void LoadBuffer2D(void* src_dram_addr,
                    uint32_t src_elem_offset,
                    uint32_t x_size,
                    uint32_t y_size,
                    uint32_t x_stride,
                    uint32_t x_pad_before,
                    uint32_t y_pad_before,
                    uint32_t x_pad_after,
                    uint32_t y_pad_after,
                    uint32_t dst_sram_index,
                    uint32_t dst_memory_type) {
    VTAMemInsn* insn = insn_queue_.CreateMemInsn(dst_memory_type);
    insn->opcode = VTA_OPCODE_LOAD;
    insn->memory_type = dst_memory_type;
    insn->sram_base = dst_sram_index;
    DataBuffer* src = DataBuffer::FromHandle(src_dram_addr);
    insn->dram_base = src->phy_addr() / GetElemBytes(dst_memory_type) + src_elem_offset;
    insn->y_size = y_size;
    insn->x_size = x_size;
    insn->x_stride = x_stride;
    insn->y_pad_0 = y_pad_before;
    insn->y_pad_1 = y_pad_after;
    insn->x_pad_0 = x_pad_before;
    insn->x_pad_1 = x_pad_after;
    this->CheckInsnOverFlow();
  }

and then ...blah blah

But the question is, I don’t know how the code upper VTA calls VTALoadBuffer2D

thierry · November 23, 2018, 8:39am

You’ll have to dig into the TVM IR passes written for VTA, like the inject_dma_intrin which pattern matches a 2D strided access pattern (think of it as a matrix tile load) to insert a DMA load call: https://github.com/dmlc/tvm/blob/1022ad7c204127ca5581505f5888929c6116790f/vta/python/vta/ir_pass.py#L313

Specifically look at the _inject_copy() helper function implementation.

There is a call to TVM IR builder that inserts the call in the lowered schedule:

            irb.emit(tvm.call_extern(
                "int32", "VTALoadBuffer2D",
                env.dev.command_handle,
                src.data, offset, x_size, y_size, x_stride,
                x_pad_before, y_pad_before,
                x_pad_after, y_pad_after,
                dst.access_ptr("r", "int32"), mem_type))

See in the IR pass implementation how the arguments to the runtime API call get derived.

Hope these pointers will help.

ricann · November 23, 2018, 8:55am

Yes, that’s the key point! But I didn’t understand the related python code now.

Next I will study the upper python code, and also post the specific hard point to get your help, thank you very much ~

renxuanle · July 29, 2019, 3:57pm

Hi Ricann, I am also confused about how the lowered TVM schedule is transformed to VTA code. Have you figured it out? BTW, where did you find this VTA’s instructions? Many thanks!

thierry · July 30, 2019, 1:05am

@renxuanle the lowered TVM schedule will call into the VTA JIT runtime API https://github.com/dmlc/tvm/blob/master/vta/include/vta/runtime.h

The runtime then assembles instructions that will program VTA.

superfu · September 25, 2019, 9:31am

Hi Thierry, I’ve learned VTA for a few days, and I have some questions want to consult you.

Will the lowered schedule be complied via LLVM before deployed on VTA, or directly calls VTA runtimes?
2.Whether VTA can do some non-linear calculation such as sigmoid now?
Thanks!

thierry · October 1, 2019, 9:43pm

@superflu

the lowered schedule will be compiled through LLVM and inserts call into the VTA JIT runtime. The compiled call is executed on the Pynq’s ARM cpu and calls the VTA runtime that performs the work offload onto the FPGA.
right now no, but we can envision some architectural/ISA extensions that can include sigmoid function calls.