[VTA] Questions about VTA packed format

Hello.
I am investigating the implementation of VTA because I want to use a neural network structure other than resnet or to make VTA available for Relay.

There was something I did not know about the pack processing mechanism, so please let me ask.
It would be greatly appreciated if there was a mistake.

  1. Pack starts when starting node is specified by start_name, or when max_pool2d comes in. If start_name is not specified, what is the intention of starting pack when max_pool2d comes up rather than other operators?

  2. If start_pack is not False, max_pool2d will generate an assert error.
    In other words, unless max_pool2d comes first, must be a structure in which global_avg_pool2d is always used before?

  3. I tried running yolo v2 graph by removing ‘assert not start_pack’, but when transpose operator comes after calling _pack_batch_channel, it could not be processed because the shape of data and the axis of transpose do not match.
    Is it necessary to implement rewriting for transpose operator in pack?

  4. As for the processing for conv2d, since counter is 0 start and the part where counter addition is performed is only in this if statement, can the condition of L282 - L291 pass? What is this counter intending?

Thanks,

I think most of the VTA example should be taken very carefully for other nets which are not resnets.

According to what I know, resnet architectures start with one conv2d-act-max_pool chain, then a chain of strided and unstrided conv2ds and residuals and lastly a global_avg_pool before the fully connected

  1. The intention is to offload everything which comes before the first max_pool to the ARM processor (they mention something about it not having enough input channel dimensions for it to be offloaded to the FPGA part).

  2. Again here, resnet architecture have by design a max_pool at the beginning and global_avg_pool at the end and only in that order.

  3. Maybe you are packing an already packed tensor? IDK this might be a problem of using yolo instead of resnet for code which expects resnet architecture

  4. Good question. It seems that that part of the code is unreachable for an initial counter=0

1 Like

Thank you for the comments.

In this network showing below, transpose is used after max_pool2d.
Since _pack_batch_channel is called in max_pool2d, the transpose input will be 6 dimensions of data.
I think it needs to call _unpack_batch_channel after max_pool2d or to handle axis with 6 according to the input dimension. Is this correct?

  %42 = max_pool2d(%41, strides='(2, 2)', pool_size='(2, 2)', padding='[0, 0]', layout='NCHW', ceil_mode='False')
  %43 = reshape(%42, shape='(1, 1, 16, 1, 208, 208)')
  %44 = transpose(%43, axes='(0, 2, 4, 5, 1, 3)')
**HERE**  %45 = transpose(%44, axes='(0, 2, 3, 1)') **HERE**
  %46 = transpose(%45, axes='(0, 3, 1, 2)')
  %47 = pad(%46, pad_width='((0, 0), (0, 0), (1, 1), (1, 1))')
  %48 = cast(%47, dtype='int8')
  %50 = transpose(%Variable_2, axes='(3, 2, 0, 1)')
  %51 = cast(%50, dtype='int8')
  %52 = reshape(%51, shape='(32, 1, 16, 1, 3, 3)')
  %53 = transpose(%52, axes='(0, 2, 4, 5, 1, 3)')
  %54 = conv2d(%48, %53, padding='[0, 0]', strides='(1, 1)', out_dtype='int32', layout='NCHW1n1c', dilation='(1, 1)', kernel_size='(3, 3)', kernel_layout='OIHW1o1i', use_bias='False', channels='32')
  nnvm._base.NNVMError: Error in operator transpose4: [11:55:39] /tvm/nnvm/src/top/tensor/transform.cc:790: Check failed: shp.ndim() == param.axes.ndim() (6 vs. 4)

  Stack trace returned 9 entries:
[bt] (0) 0   libnnvm_compiler.dylib              0x000000012dd470a0 dmlc::StackTrace(unsigned long) + 464
[bt] (1) 1   libnnvm_compiler.dylib              0x000000012dd46d84 dmlc::LogMessageFatal::~LogMessageFatal() + 52
[bt] (2) 2   libnnvm_compiler.dylib              0x000000012df4a6fc nnvm::top::TransposeShape(nnvm::NodeAttrs const&, std::__1::vector<nnvm::TShape, std::__1::allocator<nnvm::TShape> >*, std::__1::vector<nnvm::TShape, std::__1::allocator<nnvm::TShape> >*) + 860
[bt] (3) 3   libnnvm_compiler.dylib              0x000000012dde50b0 nnvm::Graph nnvm::pass::(anonymous namespace)::InferAttr<nnvm::TShape, nnvm::pass::(anonymous namespace)::$_0::operator()(nnvm::Graph) const::'lambda'(nnvm::TShape const&), std::nullptr_t>(nnvm::Graph&&, nnvm::TShape, char const*, char const*, char const*, char const*, char const*, nnvm::pass::(anonymous namespace)::$_0::operator()(nnvm::Graph) const::'lambda'(nnvm::TShape const&), std::nullptr_t)::'lambda'(unsigned int, bool)::operator()(unsigned int, bool) const + 2720
[bt] (4) 4   libnnvm_compiler.dylib              0x000000012dde31aa std::__1::__function::__func<nnvm::pass::(anonymous namespace)::$_0, std::__1::allocator<nnvm::pass::(anonymous namespace)::$_0>, nnvm::Graph (nnvm::Graph)>::operator()(nnvm::Graph&&) + 3466
[bt] (5) 5   libnnvm_compiler.dylib              0x000000012ddba21f nnvm::ApplyPasses(nnvm::Graph, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) + 1407
[bt] (6) 6   libnnvm_compiler.dylib              0x000000012dd42e56 NNGraphApplyPasses + 566
[bt] (7) 7   _ctypes.cpython-36m-darwin.so       0x000000010f0f3e5f ffi_call_unix64 + 79
[bt] (8) 8   ???                                 0x00007ffee7d74850 0x0 + 140732788066384

I assume that is correct. Basically you have to handle the missmatch of the tensor shape between the nodes which get injected by the VTA code and those which were native to the net graph.

Thanks. I’ll give it a try.

In TinyYolo v2, the number of output channels is output as 125 at the last conv2d, but in the processing of _pack_weight, it is assumed that the channel size is divisible by cfactor, so it seems necessary to handle it.
As a way to cope,

  1. Increase the number of output channels to be divisible
  2. Avoid pack in this conv2d

Is my understanding correct?

This would be the easiest way yes, but then you would basically be saying “compute this in the ARM”.

You could implement the cases when assert dshape[1] % cfactor != 0. But this would require (IMO) way more work.

Also got an error at line assert dshape[1] % cfactor == 0 of _pack_weight with a MobileNetV2-based NN made with PyTorch. When playing with start_name and stop_name and their indexes, it appears that some layers have Tensor sizes of [a, 1, b, c], causing the assert to fail. Those layers seem to be the bottleneck layers’ middle convolution layers…

Would implementing a assert dshape[1] % cfactor != 0 case really be the only way to fix this issue?

Has anyone else tried MobileNetV2 with a VTA graph_pack by any chance?

Thanks.

Hi, has your problem been solved?

The problem raised by @KZamudio is solved. The 1 in the tensor shape [a, 1, b, c] corresponds to the depthwise convolution (convolution is only computed on 1 channel at a time). The solution was to transform the depthwise convolution into a grouped convolution with groups of size 16 as VTA only supports convolutions with a number of channels which is a multiple of 16.