Modifying VTA instruction


#1

Dear developers, I am thinking of changing the VTA instruction set a little for my experiment. Currently, the convolution operator is translated into a sequence of matrix multiplications using the GEMM instruction.


And depending on the order of computation (by reordering the axis of computation), the bit fields of the VTA GEMM instrction are generated differrently.

For example, consider the convolution of

  • input : (batch, channel, height, width) = (1,128,56,56)
  • kernel : (number of kernels, channel, height, width) = (64,128,3,3)

The following shows part of the “loop structure” expression (returned from some VTA custom IR pass function) that defines the above convolution.

...
for (j, 0, 56) {         ----> iter_out = 56
  for (d_j, 0, 3) {      ----> iter_in = 3
    for (d_i, 0, 3) {
      for (c_o, 0, 2) {  ----> uop_end = uop_begin + (3*2*8) = uop_begin + 48
        for (i, 0, 8) {
          // attr [[buffer(local.inp_buffer, 0x560a73fa6e50), Tensor(shape=[1, 8, 58, 58, 1, 16], op.name=pad_data)]] buffer_bind_scope = tvm_tuple(0, 1, k_o.outer, 1, ((i + (ax2.outer*8)) + d_i), 1, (j + d_j), 1, 0, 1, 0, 16)
          // attr [[buffer(local.wgt_buffer, 0x560a73fa6ab0), Tensor(shape=[4, 8, 3, 3, 16, 16], op.name=input1.local.wgt_buffer)]] buffer_bind_scope = tvm_tuple((c_o + (cthread*2)), 1, k_o.outer, 1, d_i, 1, d_j, 1, 0, 16, 0, 16)
          // attr [[buffer(local.acc_buffer, 0x560a73f5d560), Tensor(shape=[1, 4, 56, 56, 1, 16], op.name=res)]] buffer_bind_scope = tvm_tuple(0, 1, (c_o + (cthread*2)), 1, (i + (ax2.outer*8)), 1, j, 1, 0, 1, 0, 16)
          // attr [iter_var(vta, , vta)] coproc_scope = 2
          // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
          VTAUopPush(0, 0, tvm_access_ptr(type_annotation(), local.acc_buffer, local.acc_buffer_elem_offset, 16, 3), tvm_access_ptr(type_annotation(), local.inp_buffer, local.inp_buffer_elem_offset, 16, 1), tvm_access_ptr(type_annotation(), local.wgt_buffer, local.wgt_buffer_elem_offset, 256, 1), 0, 0, 0)
        }
      }
    }
  }
}
...

Here, the bit fields inside the VTA GEMM instructions are:

  • iter_out = 56
  • iter_in = 3
  • uop_begin = 0
  • uop_end = 3x2x8 = 48

The problem here is that we lose information about the loop structure computation: the iteration numbers of the 3 inner loops (3,2,and 8) get combined into a single number 48.

Question
So I would like to preserve this information in the VTA GEMM instruction that gets generated. For example, instead of having uop_begin = 0 and uop_end = 48, we might have uop_x = 3, uop_y = 2, and uop_z = 8.

I guess the current implementation has some compilation stage detecting the above loop structure (that defines the computation of the conv2d operator). Then it somehow counts the total number of the 3 inner loops, and therefore ends up having uop_end=3x2x8=48. I cannot pinpoint this step in the code repository ( runtime.cc, vta/ir_pass.py, etc. ?). Could you point me out to the correct files or the correct steps in order to achieve this?


#2

Hi @ignite, the code here may answer the questions you had about the uop loop bound derivation:

Hope this helps.

Thierry