[RFC] Discuss TVM v0.7 Roadmap

liangfu · December 18, 2019, 2:26am

It’ll be very nice to have all these new features, especially “Unified runtime for heterogeneous execution” and “better documentation for developers”.

In addition, I think we are also getting close to have

External code generator
End to end inference with Chisel VTA (, with multiple drivers support)

masahi · December 18, 2019, 4:48am

Are there any interests in 4 bit quantization? At Neurips 19 Intel published an interesting paper which demonstrates high accuracy inference in 4 bit. I think their method fits well to TVM’s automatic quantization support (workflow is similar to KL divergence minimization based approach).

Besides the quantization method itself, we also need to consider how to handle packing and unpacking of 4 bit values in codegen and runtime.

haichen · December 18, 2019, 6:44am

A few things that we could add to v0.7 roadmap

Complete VM functionality
Improve VM performance
Add tutorial for VM
Reduce AutoTVM tuning time (could be related to auto scheduling)
Auto tensorization
half2 data type support

sunlex0717 · December 18, 2019, 1:13pm

It would be nice If we can have Relay Visualization tools. It helps developers to see the graph-level optimizations.

ramana-arm · December 18, 2019, 4:38pm

What is the perceived release cycle ? Are we planning 0.7 in March / April 2020 time frame ?

A few things to consider.

Command line utilties to use TVM as an ahead of time compiler. We are already starting a discuss post on that.
Improving test and benchmark infrastructure.
Testing and benchmarking on remote targets.

adb · December 18, 2019, 10:10pm

Could configurable fusion fall under this category? There was some discussion/interest in a previous thread. Since fusion rules are hardcoded at the moment, it would be nice to have an easy way to set fusion rules for different devices. In a heterogeneous environment or SOC with multiple kinds of accelerators this could ease pains of rewriting fusion passes for every device.

FrozenGene · December 24, 2019, 6:18am

It would be nice to consider spending some time in performance improvement (on one specific platform compared with framework or library provide by hardware vendor) and submit to MLPerf (https://mlperf.org/inference-results/?spm=ata.13261165.0.0.41ac3a78f1Uzk9) in v0.7 release cycle.

lsy643 · December 26, 2019, 8:27am

It would be very helpful for TVM’s application if the relay.frontend.from_tensoflow can correctly convert models from the Tensorflow object_detection library.

junrushao · December 26, 2019, 8:30am

This would super helpful to see a working version of auto scheduler, as many have mentioned above.

jwfromm · December 28, 2019, 3:42am

Would adding a Netron (https://github.com/lutzroeder/netron) backend for Relay accomplish this? Or are you interested in visualizations besides the computation graph?

Laurawly · December 30, 2019, 5:02pm

Nvidia has an interesting post recently demonstrating their fine-tuning method without retraining the original model for int4 and has only ~1% accuracy loss for resnet50. Their results could be a competitive comparison for TVM 4 bit quantization.

If we assume that we have already got high-accuracy int4 models, we should also focus on int4 tensor core optimization for model inference (mostly for convolution) to increase the speed of inference which other DL acceleration libs don’t support yet.

Laurawly · December 30, 2019, 10:50pm

Making loop partition better is important for dynamic shape scheduling. If we have some fully optimized small kernels (gemm, conv), ideally accelerated by tensor cores, when we receive a dynamic larger kernel size, loop partition should be able to decompose it to make use of the fully optimized small kernels without introducing many if conditions (ideally less than or equal to one for the remainder computation). And in this way, the default performance should be fairly acceptable for dynamic shape scheduling without further tuning.

adb · December 30, 2019, 7:47pm

Tensorization of composition ops

Right now, for example, we cannot tensorize for an accelerator with inner loops that are a composition of gemv+bias+act. This makes CodeGen difficult for many accelerators.

tkclimb · January 17, 2020, 4:45am

I think the nvidia’s post seems to need the original dataset for their last fine-tuning step but intel’s research doesn’t need them, it’s just pure post-qunatization method with some stochastic data processing only while the final accuracy loss (< 3%) on the same model is slightly less than that of nvidia (< 1%).
(The intel’s method don’t use just 4-bits. it’s sometimes more than it like 5 because of their technic per-channel-bit-allocation, so I know It’s not completely fair comparison)

Therefore, I guess the post-quantization without training dataset is more properly to add to TVM by this nature, so what do you think?

tkclimb · January 17, 2020, 4:51am

Also, if TVM enables int4 type computable somehow, this would need to be simulated in software since it’s not normal cpu primitive type. Adding this implementation may require some tedious handling such as promote, trunc, overflow detection and etc. In this case, I think implementing arbitrary precision type handling is reasonable not int4 specific one. it would be able to make it easier other to support other precisions and more special hardware backends.

masahi · January 17, 2020, 5:43am

Intel’s method also requires calibration dataset.

tkclimb · January 17, 2020, 7:02am

I know the calibration needs dataset but it’s not a one having been used to train the model. I think this way is like the quantization TVM currently supports (KL divergence mode).

ZhennanQin · January 20, 2020, 7:22am

We can consider to support Bfloat16 data type in native. More and more hardware will support this type since this year. Because it’s not a standard type in generic programming language(c++ / numpy) and traditional compiler (gcc / llvm), developer is facing a challenge of writing kernel with it. It’s a huge advantage if TVM can generate Bfloat16 kernel efficiently, and TVM is designed to do this well.

tqchen · January 20, 2020, 5:40pm

Thanks for all the suggestions so far, we will summarize the roadmap near the end of the month. The current goal is to aim for a three month timeframe (April).

Besides the set of features we want to see, we would love to what our community members want to work on(either new proposals) or some of these existing ones. It would help us coordinate and estimate feasibilities of these items

ziheng · January 22, 2020, 1:07am

I will mainly work on the automated quantization project in the next three months, see the RFC for details:

In summary, I hope that with this project we can provide a easy-to-use quantization tool for users, which can be adapted to different models, and different hardwares quickly.