VTA First Conv Layer Optimize

Hi There,

VTA first conv layer is running on CPU and not get offload into FPGA, in most case that is a performance bottle neck and need optimization, following are some idea about the optimization, please kindly comments.

Regards Hua

  1. training network to make first conv layer support int8 input and weight, add feature into vta to support using 16*16 MAC to compute 3 input channel compute.

  2. When running on arm-cpu, seems like only one cpu get used for first conv compute, we may can do parallel to running first conv in multiple cpu for accelerate.