Reordering the `conv2d_stage` stage in vta_conv2d.py

ignite · May 4, 2019, 8:15am

Dear Developers,

I was looking into the schedule_packed_conv2d function in the vta_conv2d.py file and was trying to reorder the axis of computation for the VTA hardware. Basically, in order to compute 1 output value in the output of shape (C, H, W), we have 5 loops as follows:

for output_column:          // W
  for output_row:           // H
    for output_channel:     // C
      for weight_x:
        for weight_y:
          Multiply()

I would like to compute the conv2d operator in the above order, so I tried changing this line:

#                       output_column   output_channel   output_row
#                                   |               |    |
s[conv2d_stage].reorder(x_bo, k_o, x_j, d_j, d_i, x_co, x_i, x_bi, x_ci, k_i)
#                                        |    |
#                                 weight_x   weight_y

to the following:

#                       output_column   row   output_channel
#                                   |    |     |
s[conv2d_stage].reorder(x_bo, k_o, x_j, x_i, x_co, d_j, d_i, x_bi, x_ci, k_i)
#                                                   |    |
#                                            weight_x   weight_y

Question
However, changing the line above gave me a runtime error at this line. What might be the cause of the error? Is there any restriction in changing the order of computation?

Traceback (most recent call last):
  File "~/test_conv2d.py", line 236, in <module>
    tcost = timer()
  File "~/tvm/python/tvm/module.py", line 178, in evaluator
    blob = feval(*args)
  File "~/tvm/python/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "~/tvm/python/tvm/_ffi/base.py", line 71, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [17:07:54] ~/tvm/vta/src/runtime.cc:251: Check failed: seq_[i].dst_idx != dst_index

github.com

dmlc/tvm/blob/48c92376fb463114209fb0a6414e278d510ce02e/vta/python/vta/top/vta_conv2d.py#L444


    s[output].bind(v_t, tvm.thread_axis("cthread"))


# virtual threading along spatial rows
if plan.h_nthread > 1:
    _, v_t = s[output].split(x_i0, factor=plan.h_nthread)
    s[output].reorder(v_t, x_bo)
    s[output].bind(v_t, tvm.thread_axis("cthread"))


x_bo, x_co, x_i, x_j, x_bi, x_ci = s[conv2d_stage].op.axis
k_o, d_i, d_j, k_i = s[conv2d_stage].op.reduce_axis
s[conv2d_stage].reorder(x_bo, k_o, x_j, d_j, d_i, x_co, x_i, x_bi, x_ci, k_i)


if plan.ic_factor:
    k_o, _ = s[conv2d_stage].split(k_o, factor=plan.ic_factor)
    s[cdata].compute_at(s[conv2d_stage], k_o)
    s[ckernel].compute_at(s[conv2d_stage], k_o)


# Use VTA instructions
s[cdata].pragma(s[cdata].op.axis[0], load_inp)
s[ckernel].pragma(s[ckernel].op.axis[0], load_wgt)
s[conv2d_stage].tensorize(x_bi, gemm)

aca88 · May 6, 2019, 9:16am

Hi,

I am not 100% sure of this response, since the master branch has an unexpected (based on your error message) code block at line 251 of the runtime.cc
I’ll paste it here just so you are aware

github.com

dmlc/tvm/blob/master/vta/src/runtime.cc#L251


   }
 }
 /*! \brief Dump kernel micro ops to stdout. */
 void Dump() {
   uint32_t size = seq_.size();
   printf("There are %u uops\n", size);
   for (uint32_t i = 0; i < size; ++i) {
     printf("[%04u]\t acc=%u, inp=%u, wgt=%u\n",
            i,
            seq_[i].dst_idx,
            seq_[i].src_idx,
            seq_[i].wgt_idx);
   }
   printf("\n");
 }


public:
 // The kernel's mode, opcode, immediate setting and value
 uint32_t mode_{0xFFFFFFFF};  // UOP type: 0xFFFFFFFF - unset, 0 - GEMM, 1 - ALU
 uint32_t opcode_{0xFFFFFFFF};
 uint32_t reset_out_{0xFFFFFFFF};

Reading the error message

I am almost certain that the error is coming from here

github.com

dmlc/tvm/blob/master/vta/src/runtime.cc#L270


 uint32_t opcode_{0xFFFFFFFF};
 uint32_t reset_out_{0xFFFFFFFF};
 bool use_imm_{false};
 int16_t imm_val_{0};


private:
 // Verify that we don't write to the same acc_mem index two cycles in a row
 void VerifyDep(uint32_t dst_index) {
   size_t step = std::min(static_cast<size_t>(2U), seq_.size());
   for (size_t i = seq_.size() - step; i < seq_.size(); ++i) {
     CHECK(seq_[i].dst_idx != dst_index);
   }
 }
 // The uop buffer
 template<int, bool, bool>
 friend class UopQueue;
 friend class CommandQueue;
 // SRAM location if begin != end.
 uint32_t sram_begin_{0};
 uint32_t sram_end_{0};
 // The signature used for verification

(also line 270 is not soo far from 251 which also seems reasonable)
I think the comment line is self explanatory // Verify that we don't write to the same acc_mem index two cycles in a row (but maybe it wasnt part of the branch you are currently working with).

Now the question is: Why can’t you write into the same acc_mem index two cycles in a row?
Easy answer: VTA developers defined this as such and you just have to accept it
Harder answers (would require a VTA dev to respond or you to look at the VTA HW and find it out for yourself):

Address generators do not support this
Pipeline structure does not support this
Some other reason

Hope this helps

thierry · May 9, 2019, 4:45am

@aca88 thank you for taking the time to reply to this question. The long answer about why you can’t write to the same address two cycles in a row has to do with how FPGA BRAMs work.

Essentially, it takes two complete cycle to write to a BRAM at address x, to see the new value appear at address x. Consider the following example:

at cycle 0: we read 0xDEADBEEF at address 0x100
at cycle 1: we write 0xCAFECAFE at address 0x100 and still read 0xDEADBEEF at address 0x100
at cycle 2: we still read 0xDEADBEEF at address 0x100
at cycle 3: we finally read 0xCAFECAFE at address 0x100

What I’m getting at here is that it takes 2 cycles for us to see the change take an effect. This is a fundamental limitation of FPGA BRAM, and we cannot circumvent it.
Consequently in HLS when we have a loop that looks like:
acc[idx] += ...
the compiler will make the worse case assumption that idx can be the same two cycles in a row: indeed if that were to be the case, we’d need to ensure that updates are not lost. If that loop always added 1 to acc[0], we would lose half of the updates: after n cycles the value in the register (assuming it’s initialized to 0) would be n/2 instead of n. Consequently the HLS compiler makes the conservative assumption that the address might be the same two cycles in a row, and increases the loop initiation interval (II) of the pipeline to 2 instead of 1. This really hurts performance since it means that we’re operating at 1/2 the throughput that we should achieve.

If that previous paragraph did not make much sense, refer to the HLS manual here, page 137, for the Removing False Dependencies to Improve Loop Pipelining example.

So in order to circumvent that limitation, we tell the compiler to trust us and assume that idx won’t be the same two cycles in a row. That means that we now need to enforce that restriction in software. When implementing VTA, we chose to make the software runtime enforce the restriction, hence the error that @ignite was getting.

As an exercise, to convince you that this is real funky hardware behavior and not some hardware guy imposing arbitrary restrictions on how to use the hardware, you can comment that runtime check out and run the program on the FPGA to see what happens. You’ll get incorrect results, because some of the accumulator register updates are being lost!

Hope that helped clarify things on top of @aca88 's great explanation!

ignite · May 9, 2019, 5:39am

Thank you to all and for taking the time to write the answers. I was able to comment out that runtime “assertion” and now I understand much more.