VTA First Conv Layer Optimize

hjiang · May 21, 2020, 8:28pm

Hi There,

VTA first conv layer is running on CPU and not get offload into FPGA, in most case that is a performance bottle neck and need optimization, following are some idea about the optimization, please kindly comments.

Regards Hua

training network to make first conv layer support int8 input and weight, add feature into vta to support using 16*16 MAC to compute 3 input channel compute.
When running on arm-cpu, seems like only one cpu get used for first conv compute, we may can do parallel to running first conv in multiple cpu for accelerate.

acapone13 · June 15, 2020, 1:28pm

Hi @hjiang,

I’m working to deploy a pre-quantized ResNet network with VTA in which the first conv layer supports int8 input/weights. I think it would be an interesting feature even though most quantization works avoid quantizing the first layer. Both ideas are valid but it would be interesting to add this feature into VTA. Please let me know how can I help you to work on this.

Regards, Augusto

hjiang · June 16, 2020, 3:53am

Hi @acapone13,

Thanks for following up this post and nice to know you are interested in VTA performance optimization related topic, about the resenet18 pretrained model, could I know which framework you use to generate the model? and how much the accuracy lost is after the quantization?

Regards

Hua

acapone13 · June 17, 2020, 7:19am

Hi @hjiang,

I use Sony’s framework NNabla to train the networks but I then convert them to ONNX or Tensorflow in order to use them with TVM. Accuracy loss gets around 4%.

Regards

Augusto

hjiang · June 19, 2020, 8:37pm

Hi @acapone13,

to apply first conv2d layer into VTA, there are 2 solution/ steps, first is to padding first conv2d from 3 channel into VTA hardware match channel for example 16, after that we would can run first quantized conv2d layer on VTA , for sure the padding would increase compute OP number and impact the performance but that would can provide a baseline for next level perf optimization.

second solution is that for some non 1x1 kernel for example 3x3 kernel, provide special optimization, these optimization is that instead of doing traditional IMG2COL blocking, we can use every 3x3x3(27) data as the input data and do related padding, these would reduce the compute increase and can improve performance.

for the first #solution proposal that need to padding the input data layer from 3 to 16*n to match vta hardware resource, for the padding part it would look like this PR https://github.com/apache/incubator-tvm/pull/4887, _const_shape_match is similar logic but it only do that for factor_out. if you have interest you can try some patch based on the said logic.

please kindly let me know if you have any better idea or any questions about the possible solutions.

Regards

Hua

acapone13 · June 25, 2020, 1:37pm

HI @hjiang,

Sorry for the late response, I’ve had some other work to do. Thanks for the proposed solutions, I’ll try this implementations with my model and I’ll keep you updated.

Regards

Augusto