[RFC] Discuss TVM v0.7 Roadmap

tqchen · December 18, 2019, 3:17am

Thanks for everyone’s effort we have successfully released v0.6. As per Apache way of development, we want broader involvement and inputs from community developers on what we want to push for in the v0.7 release cycle. Previously, a roadmap thread is created in the github, we want to try something a bit different in this cycle by having a discussion thread summarizing the directions we would like to see in the v0.7 release cycle. The final roadmap will be a summary based on the discussion.

In terms of timeline, let us set the expectation for what we want to see in the next three months. But other inputs to the longer term projects are also welcomed. Here are some example directions that we saw during the tvm conference:

More transparency in terms of roadmap and directions – this thread
More auto scheduling
Enhanced support for high-level graph rewriting for accelerators
Better documentations for developments
Unified runtime for heterogeneous execution
End to end uTVM
Unified IR improvements
More dynamic model support

Please share your thoughts. In particular:

What you would like to see in the roadmap
What you would like contribute to the community

tqchen · December 18, 2019, 1:25am

tqchen · December 18, 2019, 1:26am

Given that many of us are going on holiday break. We will keep this discussion thread open until the mid of Jan

liangfu · December 18, 2019, 2:26am

It’ll be very nice to have all these new features, especially “Unified runtime for heterogeneous execution” and “better documentation for developers”.

In addition, I think we are also getting close to have

External code generator
End to end inference with Chisel VTA (, with multiple drivers support)

masahi · December 18, 2019, 4:48am

Are there any interests in 4 bit quantization? At Neurips 19 Intel published an interesting paper which demonstrates high accuracy inference in 4 bit. I think their method fits well to TVM’s automatic quantization support (workflow is similar to KL divergence minimization based approach).

Besides the quantization method itself, we also need to consider how to handle packing and unpacking of 4 bit values in codegen and runtime.

haichen · December 18, 2019, 6:44am

A few things that we could add to v0.7 roadmap

Complete VM functionality
Improve VM performance
Add tutorial for VM
Reduce AutoTVM tuning time (could be related to auto scheduling)
Auto tensorization
half2 data type support

sunlex0717 · December 18, 2019, 1:13pm

It would be nice If we can have Relay Visualization tools. It helps developers to see the graph-level optimizations.

ramana-arm · December 18, 2019, 4:38pm

What is the perceived release cycle ? Are we planning 0.7 in March / April 2020 time frame ?

A few things to consider.

Command line utilties to use TVM as an ahead of time compiler. We are already starting a discuss post on that.
Improving test and benchmark infrastructure.
Testing and benchmarking on remote targets.

adb · December 18, 2019, 10:10pm

Could configurable fusion fall under this category? There was some discussion/interest in a previous thread. Since fusion rules are hardcoded at the moment, it would be nice to have an easy way to set fusion rules for different devices. In a heterogeneous environment or SOC with multiple kinds of accelerators this could ease pains of rewriting fusion passes for every device.

FrozenGene · December 24, 2019, 6:18am

It would be nice to consider spending some time in performance improvement (on one specific platform compared with framework or library provide by hardware vendor) and submit to MLPerf (https://mlperf.org/inference-results/?spm=ata.13261165.0.0.41ac3a78f1Uzk9) in v0.7 release cycle.

lsy643 · December 26, 2019, 8:27am

It would be very helpful for TVM’s application if the relay.frontend.from_tensoflow can correctly convert models from the Tensorflow object_detection library.

junrushao · December 26, 2019, 8:30am

This would super helpful to see a working version of auto scheduler, as many have mentioned above.

jwfromm · December 28, 2019, 3:42am

Would adding a Netron (https://github.com/lutzroeder/netron) backend for Relay accomplish this? Or are you interested in visualizations besides the computation graph?

Laurawly · December 30, 2019, 5:02pm

Nvidia has an interesting post recently demonstrating their fine-tuning method without retraining the original model for int4 and has only ~1% accuracy loss for resnet50. Their results could be a competitive comparison for TVM 4 bit quantization.

If we assume that we have already got high-accuracy int4 models, we should also focus on int4 tensor core optimization for model inference (mostly for convolution) to increase the speed of inference which other DL acceleration libs don’t support yet.

Laurawly · December 30, 2019, 10:50pm

Making loop partition better is important for dynamic shape scheduling. If we have some fully optimized small kernels (gemm, conv), ideally accelerated by tensor cores, when we receive a dynamic larger kernel size, loop partition should be able to decompose it to make use of the fully optimized small kernels without introducing many if conditions (ideally less than or equal to one for the remainder computation). And in this way, the default performance should be fairly acceptable for dynamic shape scheduling without further tuning.

adb · December 30, 2019, 7:47pm

Tensorization of composition ops

Right now, for example, we cannot tensorize for an accelerator with inner loops that are a composition of gemv+bias+act. This makes CodeGen difficult for many accelerators.

tkclimb · January 17, 2020, 4:45am

I think the nvidia’s post seems to need the original dataset for their last fine-tuning step but intel’s research doesn’t need them, it’s just pure post-qunatization method with some stochastic data processing only while the final accuracy loss (< 3%) on the same model is slightly less than that of nvidia (< 1%).
(The intel’s method don’t use just 4-bits. it’s sometimes more than it like 5 because of their technic per-channel-bit-allocation, so I know It’s not completely fair comparison)

Therefore, I guess the post-quantization without training dataset is more properly to add to TVM by this nature, so what do you think?

tkclimb · January 17, 2020, 4:51am

Also, if TVM enables int4 type computable somehow, this would need to be simulated in software since it’s not normal cpu primitive type. Adding this implementation may require some tedious handling such as promote, trunc, overflow detection and etc. In this case, I think implementing arbitrary precision type handling is reasonable not int4 specific one. it would be able to make it easier other to support other precisions and more special hardware backends.

masahi · January 17, 2020, 5:43am

Intel’s method also requires calibration dataset.

tkclimb · January 17, 2020, 7:02am

I know the calibration needs dataset but it’s not a one having been used to train the model. I think this way is like the quantization TVM currently supports (KL divergence mode).