Reordering the `conv2d_stage` stage in vta_conv2d.py


#1

Dear Developers,

I was looking into the schedule_packed_conv2d function in the vta_conv2d.py file and was trying to reorder the axis of computation for the VTA hardware. Basically, in order to compute 1 output value in the output of shape (C, H, W), we have 5 loops as follows:

for output_column:          // W
  for output_row:           // H
    for output_channel:     // C
      for weight_x:
        for weight_y:
          Multiply()

I would like to compute the conv2d operator in the above order, so I tried changing this line:

#                       output_column   output_channel   output_row
#                                   |               |    |
s[conv2d_stage].reorder(x_bo, k_o, x_j, d_j, d_i, x_co, x_i, x_bi, x_ci, k_i)
#                                        |    |
#                                 weight_x   weight_y

to the following:

#                       output_column   row   output_channel
#                                   |    |     |
s[conv2d_stage].reorder(x_bo, k_o, x_j, x_i, x_co, d_j, d_i, x_bi, x_ci, k_i)
#                                                   |    |
#                                            weight_x   weight_y

Question
However, changing the line above gave me a runtime error at this line. What might be the cause of the error? Is there any restriction in changing the order of computation?

Traceback (most recent call last):
  File "~/test_conv2d.py", line 236, in <module>
    tcost = timer()
  File "~/tvm/python/tvm/module.py", line 178, in evaluator
    blob = feval(*args)
  File "~/tvm/python/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "~/tvm/python/tvm/_ffi/base.py", line 71, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [17:07:54] ~/tvm/vta/src/runtime.cc:251: Check failed: seq_[i].dst_idx != dst_index


#2

Hi,

I am not 100% sure of this response, since the master branch has an unexpected (based on your error message) code block at line 251 of the runtime.cc
I’ll paste it here just so you are aware

Reading the error message

I am almost certain that the error is coming from here

(also line 270 is not soo far from 251 which also seems reasonable)
I think the comment line is self explanatory // Verify that we don't write to the same acc_mem index two cycles in a row (but maybe it wasnt part of the branch you are currently working with).

Now the question is: Why can’t you write into the same acc_mem index two cycles in a row?
Easy answer: VTA developers defined this as such and you just have to accept it
Harder answers (would require a VTA dev to respond or you to look at the VTA HW and find it out for yourself):

  • Address generators do not support this
  • Pipeline structure does not support this
  • Some other reason

Hope this helps :slight_smile:


#3

@aca88 thank you for taking the time to reply to this question. The long answer about why you can’t write to the same address two cycles in a row has to do with how FPGA BRAMs work.

Essentially, it takes two complete cycle to write to a BRAM at address x, to see the new value appear at address x. Consider the following example:

  • at cycle 0: we read 0xDEADBEEF at address 0x100
  • at cycle 1: we write 0xCAFECAFE at address 0x100 and still read 0xDEADBEEF at address 0x100
  • at cycle 2: we still read 0xDEADBEEF at address 0x100
  • at cycle 3: we finally read 0xCAFECAFE at address 0x100

What I’m getting at here is that it takes 2 cycles for us to see the change take an effect. This is a fundamental limitation of FPGA BRAM, and we cannot circumvent it.
Consequently in HLS when we have a loop that looks like:
acc[idx] += ...
the compiler will make the worse case assumption that idx can be the same two cycles in a row: indeed if that were to be the case, we’d need to ensure that updates are not lost. If that loop always added 1 to acc[0], we would lose half of the updates: after n cycles the value in the register (assuming it’s initialized to 0) would be n/2 instead of n. Consequently the HLS compiler makes the conservative assumption that the address might be the same two cycles in a row, and increases the loop initiation interval (II) of the pipeline to 2 instead of 1. This really hurts performance since it means that we’re operating at 1/2 the throughput that we should achieve.

If that previous paragraph did not make much sense, refer to the HLS manual here, page 137, for the Removing False Dependencies to Improve Loop Pipelining example.

So in order to circumvent that limitation, we tell the compiler to trust us and assume that idx won’t be the same two cycles in a row. That means that we now need to enforce that restriction in software. When implementing VTA, we chose to make the software runtime enforce the restriction, hence the error that @ignite was getting.

As an exercise, to convince you that this is real funky hardware behavior and not some hardware guy imposing arbitrary restrictions on how to use the hardware, you can comment that runtime check out and run the program on the FPGA to see what happens. You’ll get incorrect results, because some of the accumulator register updates are being lost!

Hope that helped clarify things on top of @aca88 's great explanation!


#4

Thank you to all and for taking the time to write the answers. I was able to comment out that runtime “assertion” and now I understand much more.