Modifying VTA instruction

ignite · April 30, 2019, 6:12pm

Dear developers, I am thinking of changing the VTA instruction set a little for my experiment. Currently, the convolution operator is translated into a sequence of matrix multiplications using the GEMM instruction.

And depending on the order of computation (by reordering the axis of computation), the bit fields of the VTA GEMM instrction are generated differrently.

For example, consider the convolution of

input : (batch, channel, height, width) = (1,128,56,56)
kernel : (number of kernels, channel, height, width) = (64,128,3,3)

The following shows part of the “loop structure” expression (returned from some VTA custom IR pass function) that defines the above convolution.

...
for (j, 0, 56) {         ----> iter_out = 56
  for (d_j, 0, 3) {      ----> iter_in = 3
    for (d_i, 0, 3) {
      for (c_o, 0, 2) {  ----> uop_end = uop_begin + (3*2*8) = uop_begin + 48
        for (i, 0, 8) {
          // attr [[buffer(local.inp_buffer, 0x560a73fa6e50), Tensor(shape=[1, 8, 58, 58, 1, 16], op.name=pad_data)]] buffer_bind_scope = tvm_tuple(0, 1, k_o.outer, 1, ((i + (ax2.outer*8)) + d_i), 1, (j + d_j), 1, 0, 1, 0, 16)
          // attr [[buffer(local.wgt_buffer, 0x560a73fa6ab0), Tensor(shape=[4, 8, 3, 3, 16, 16], op.name=input1.local.wgt_buffer)]] buffer_bind_scope = tvm_tuple((c_o + (cthread*2)), 1, k_o.outer, 1, d_i, 1, d_j, 1, 0, 16, 0, 16)
          // attr [[buffer(local.acc_buffer, 0x560a73f5d560), Tensor(shape=[1, 4, 56, 56, 1, 16], op.name=res)]] buffer_bind_scope = tvm_tuple(0, 1, (c_o + (cthread*2)), 1, (i + (ax2.outer*8)), 1, j, 1, 0, 1, 0, 16)
          // attr [iter_var(vta, , vta)] coproc_scope = 2
          // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
          VTAUopPush(0, 0, tvm_access_ptr(type_annotation(), local.acc_buffer, local.acc_buffer_elem_offset, 16, 3), tvm_access_ptr(type_annotation(), local.inp_buffer, local.inp_buffer_elem_offset, 16, 1), tvm_access_ptr(type_annotation(), local.wgt_buffer, local.wgt_buffer_elem_offset, 256, 1), 0, 0, 0)
        }
      }
    }
  }
}
...

Here, the bit fields inside the VTA GEMM instructions are:

iter_out = 56
iter_in = 3
uop_begin = 0
uop_end = 3x2x8 = 48

The problem here is that we lose information about the loop structure computation: the iteration numbers of the 3 inner loops (3,2,and 8) get combined into a single number 48.

Question
So I would like to preserve this information in the VTA GEMM instruction that gets generated. For example, instead of having uop_begin = 0 and uop_end = 48, we might have uop_x = 3, uop_y = 2, and uop_z = 8.

I guess the current implementation has some compilation stage detecting the above loop structure (that defines the computation of the conv2d operator). Then it somehow counts the total number of the 3 inner loops, and therefore ends up having uop_end=3x2x8=48. I cannot pinpoint this step in the code repository ( runtime.cc, vta/ir_pass.py, etc. ?). Could you point me out to the correct files or the correct steps in order to achieve this?

thierry · May 1, 2019, 1:20am

Hi @ignite, the code here may answer the questions you had about the uop loop bound derivation:

github.com

dmlc/tvm/blob/master/vta/python/vta/ir_pass.py#L42


stmt : Stmt
    The AttrStmt


key : str
    The pragma key
"""
return ((stmt.attr_key == "pragma_" + key) or
        (stmt.attr_key == "pragma_scope" and stmt.value.value == key))




def fold_uop_loop(stmt_in):
"""Detect and fold uop loop.


VTA support uop programming model
that recognizes loop structure.
This pass detect the loop structure
and extract that into uop loop AST.


Parameters
----------
stmt_in : Stmt

Hope this helps.

Thierry