End-to-end (TVM+VTA) flow tutorial with Yolo v3


#1

Dear tvm community members,

I want to learn the end-to-end flow with Yolo v3, which means not only porting darknet yolov3 model with tvm/relay, but also compiling the model into VTA micro-op instructions, run the model on VTA RTL simlulation with a given image, and finally get a output image with labled bounding boxes.

I am aware of this tutorial:
https://docs.tvm.ai/tutorials/frontend/from_darknet.html

But looks like it stops on tvm, and doesn’t talk about the flow regarding to VTA.

Would it be possible to complement this tutorial with end-to-end example flow ?

Thanks very much!

Kevin


#2

Hi Kevin,

Having a YOLOv3 out of the box demo would be nice. It would require a bit of footwork in terms of applying quantization correctly and inserting the right stop_fusion nodes to pattern match the operators that VTA supports (until we have a good pattern matcher). Once we have a set of operators offloadable to VTA, we can run autoTVM scheduling to have optimized inference on a variant of VTA.

If you want to give it a shot to have it run ASAP, I’d be happy to guide you through it.


#3

@kevinyuan, @thierry,

I would join you too.

  • Added the small variant yolov3-tiny too for the scope
  • Its small enough (E.x ~100ms on Mali GPU with float16), so could be more VTA friendly.

Derived from actual tutorial I can help implement/extend it to use video-frames(opencv) + quantize(tvm/int8) + tune(autotvm) (be back with a GIST proposal).

But then, I provoke @thierry @vegaluis to target it on ICE40 (5k) (with only yosys + nextpnr) :slight_smile: :blush: :smile:

Let me prepare a GIST with the flow (demo on CPU), i’ll stop at VTA part then i’ll be back :wink: .


#4

Nice! I think that tiny YOLO will give us a good starting point for real time object detection :slight_smile:

Let me know how the compilation goes. I expect quantization pass might break, just let me know which roadblocks you’re running into! Also list the convolution operators that need tuning, I can spin some tuning jobs and update TOPHUB accordingly.


#5

@thierry, @cbalint13,

Very appreciate for you quick response and actions on this request.

I would like to contribute to this effort as well, but my experiences are mostly FPGA/ASIC front-end design, and I am quite new to Python/TVM/VTA/Chisel.

I think I can try to add some missing functions (e.g. operator offloading) into VTA with verilog or Chisel, while I certainly need your guidance how to define the missing functions and the detail specification of the function design that best fit into TVM/VTA architecture.

Please don’t be hesitate to let me know what and how I can contribute :slightly_smiling_face:

Best regards.


#6

@thierry , @kevinyuan,

Allow 1-2 day to test/elaborate, i already done it partially just need to wrap things together. Will ask for help (from quantization folks to quick look), there are also pending PR for quantization.

Exaggerated with ice40 targets but one day we could conquer it see e.g. MARLANN. Would be also interesting in future to have low bit bitpacked operators for VTA, then it really could fit on small FPGA.


#7

Working on bitpacked operators would be super interesting on VTA, this is a direction I’m looking into enabling in hardware/software, but a lot of work will need to be done on training/quantization to enable it.

In terms of FPGA coverage, are there low-power FPGAs that @cbalint13 and @kevinyuan would be interested in providing preliminary support to other than the ice40? I think it might be interesting to see if we can instantiate a VTA design on an FPGA ~10x smaller than originally designed for. We could come up with interesting optimizations, or re-organizations.


#8

@thierry ,

Same random thoughts on fpga targets:

  • Low sized FPGA would be interesting as ultra low power applications like TinBiNN showcased by Lattice or MARLANN does it on ICE40. I am confident that at least state-of-art can be achieved in terms of smallness and low power consumption. These smaller devices also have the advantage to be syntesiable end-to-end with opensource tools too, so they can become very popular if not already are like this board upduino ,even Lattice support and showcase it as third-party board. The low-power target field is still poorly covered yet by the industry, there is lot of open room.

  • Also Lattice ECP5 (middle-low size) series are now supported by opensource community on boards like TinyFPGA-EX, and if i am not mistaken its showcased by company like XNOR.ai here as industry’s first low power target AI applications.

  • On high end stand alone FPGA (with some PCIe) from Xilinx7 family are also interesting especially the affordable ones like e.g. CrowdSupply. It is large enough to experiment and also synthethisable with opensource tools soon too. Such boards can be build even as DIY with ease, no very special requirements or pricetag.

  • True high end ones like Ultra+ became unaccessible for many people however those can deliver real state-of-art performances (but not so sure about when compared to ASIC competitors).


#9

Thanks @cbalint13 for the suggestions. It would be great to have a contributor work on Lattice tool chains support. Recently, TVM reviewer @liangfu added support for Intel (formerly Altera) FPGA SoC support. We could perhaps pick a Lattice FPGA that has microcontroller support. Thoughts?


#10

I also realize that we’ve diverted from the topic of the original thread, so feel free to add a new one.


#11

@thierry, @kevinyuan,

Prepared an end-to-end demo script (on CPU) here that do:

  • takes yolov3-tiny (can be ‘yolov3’, ‘yolov2’ but not tested)
  • import it to via relay graph
  • quantize net using KL statistics (latest PR #3854)
  • tune the resulting network (optional, uncomment L348), with resume support
  • evaluate final inference time per single frame
  • run demo on this video in real time on the screen.

For now is CPU only, can be adapted to VTA (help needed).

Note that frame resizing, box & other graphic overlay at display time is at orders more time consuming than inference itself, but this is ment to be a demo/tutorial at all.


#12

Very neat; this will be a great starting point to target VTA. I’ll start to take a look at the operators so we can make sure that we have proper coverage on VTA.

What are you running the demo on?


#13

@thierry, @kevinyuan,

  • Update the script to revision 4 (works better, also tested with ‘yolov3-tiny’ and ‘yolov3’ & ‘yolov2’).
  • Also for local CPU there exposed a generic tuning file for each layer (no AVX2, that would be much faster).
  • Except video file all downloads goes automatic in the script, useful if we want end-to-end tutorial.
  • It is possible to use a camera instead of video, i’ll add a cfg switch for this in next revision 5 (be back).

It hits ~100ms inference time on CPU (old IvyBridge), curios on VTA how it would do on various targets (de10, pynq, ultra96).
ATM don’t have any of mentioned board but would looking forward to add support for artix/kintex7 or smaller ecp5 (cpu-less) with softcore (e.g. it could be risc-v if it is the only way).


#14

Thank you @cbalint13; I’d like to try on the pynq and VTA. Will update you when I get something running.