[VTA] Bringing depthwise convolution support

In order to bring MobileNet inference support, computation schedules for group conv2d and depthwise conv2d are required to implemented. Currently we have group conv2d support in #4421, however, we don’t have depthwise conv2d support.

Take the first depthwise conv2d layer in MobileNetV1-1.0 for an example, it takes input_shape=(1, 32, 112, 112), weight_shape=(32, 1, 3, 3), and expects output_shape=(1, 32, 112, 112)

Challenges

First, in depthwise conv2d, the accumulation ONLY takes at the spatial axis. The GEMM instruction performs 16 fused-multiply-add (FMA) operations, while the work load requires 9 such operations. An easy workaround is to transform the input with im2col, so that the depthwise conv2d operator could be easily transformed into matrix multiplication problem. A drawback of this approach is that it consumes a large amount of memory space if we leave it completely leave the im2col operation completely in a separate operator.

Second, to maximize the utilization of GEMM compute unit, it’s better to load inputs in local.wgt_buffer and load weights in local.inp_buffer. Specifically, if we could load 3x3 spatial data in a 1x16 vector as weight buffer in local.inp_buffer and load 16x16 inputs in the local.wgt_buffer, the spatial data could be reused to multiply with all inputs. However, exchanging inputs buffer and weights buffer might cause some additional problems, or is it worth doing so.

Please share your thought in supporting tensorized computation in depthwise convolution operators.

2 Likes

Hi @liangfu, thank you for this important topic. So, if I understand correctly, the present infra-structure do support group conv2d (as I have seen), but not depth-wise conv2d. For this reason, we can run the benchmark tests on MobileNet layers, but not the full MobileNet network for inference on VTA as depth-wise convolution is not yet supported. What’s the present status of this support? What additional features do you think is required from the compiler to support the second option that you mentioned?

I failed to fully utilize the hardware resource in performing depth-wise conv2d op. If we could treat the hardware just as a vector processor unit, it would be a lot easier. I think, as a long term goal of this project, we might need the support in AutoTensorize feature .

@liangfu @suvadeep I was recently trying to deploy the network on an FPGA using VTA, but I ran into this problem when deploying mobilenetV2. When executing nn.conv2d, I set groups to the following format:

groups = int(in_planes/16)
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, groups=groups ,stride=stride ,padding=1, bias=False)

Although such setting can be achieved, the final result is inconsistent with the result obtained by software training. I don’t know why such a result occurs. Have you tried to implement Mobilenet and verified the correctness of the result?