[RFC] Discuss TVM v0.7 Roadmap

It’ll be very nice to have all these new features, especially “Unified runtime for heterogeneous execution” and “better documentation for developers”.

In addition, I think we are also getting close to have

  • External code generator
  • End to end inference with Chisel VTA (, with multiple drivers support)

Are there any interests in 4 bit quantization? At Neurips 19 Intel published an interesting paper which demonstrates high accuracy inference in 4 bit. I think their method fits well to TVM’s automatic quantization support (workflow is similar to KL divergence minimization based approach).

Besides the quantization method itself, we also need to consider how to handle packing and unpacking of 4 bit values in codegen and runtime.

4 Likes

A few things that we could add to v0.7 roadmap

  • Complete VM functionality
  • Improve VM performance
  • Add tutorial for VM
  • Reduce AutoTVM tuning time (could be related to auto scheduling)
  • Auto tensorization
  • half2 data type support
2 Likes

It would be nice If we can have Relay Visualization tools. It helps developers to see the graph-level optimizations.

3 Likes

What is the perceived release cycle ? Are we planning 0.7 in March / April 2020 time frame ?

A few things to consider.

  • Command line utilties to use TVM as an ahead of time compiler. We are already starting a discuss post on that.
  • Improving test and benchmark infrastructure.
  • Testing and benchmarking on remote targets.

Could configurable fusion fall under this category? There was some discussion/interest in a previous thread. Since fusion rules are hardcoded at the moment, it would be nice to have an easy way to set fusion rules for different devices. In a heterogeneous environment or SOC with multiple kinds of accelerators this could ease pains of rewriting fusion passes for every device.

1 Like

It would be nice to consider spending some time in performance improvement (on one specific platform compared with framework or library provide by hardware vendor) and submit to MLPerf (https://mlperf.org/inference-results/?spm=ata.13261165.0.0.41ac3a78f1Uzk9) in v0.7 release cycle.

3 Likes

It would be very helpful for TVM’s application if the relay.frontend.from_tensoflow can correctly convert models from the Tensorflow object_detection library.

This would super helpful to see a working version of auto scheduler, as many have mentioned above.

1 Like

Would adding a Netron (https://github.com/lutzroeder/netron) backend for Relay accomplish this? Or are you interested in visualizations besides the computation graph?

2 Likes

Nvidia has an interesting post recently demonstrating their fine-tuning method without retraining the original model for int4 and has only ~1% accuracy loss for resnet50. Their results could be a competitive comparison for TVM 4 bit quantization.

If we assume that we have already got high-accuracy int4 models, we should also focus on int4 tensor core optimization for model inference (mostly for convolution) to increase the speed of inference which other DL acceleration libs don’t support yet.

3 Likes

Making loop partition better is important for dynamic shape scheduling. If we have some fully optimized small kernels (gemm, conv), ideally accelerated by tensor cores, when we receive a dynamic larger kernel size, loop partition should be able to decompose it to make use of the fully optimized small kernels without introducing many if conditions (ideally less than or equal to one for the remainder computation). And in this way, the default performance should be fairly acceptable for dynamic shape scheduling without further tuning.

3 Likes
  • Tensorization of composition ops

Right now, for example, we cannot tensorize for an accelerator with inner loops that are a composition of gemv+bias+act. This makes CodeGen difficult for many accelerators.

I think the nvidia’s post seems to need the original dataset for their last fine-tuning step but intel’s research doesn’t need them, it’s just pure post-qunatization method with some stochastic data processing only while the final accuracy loss (< 3%) on the same model is slightly less than that of nvidia (< 1%).
(The intel’s method don’t use just 4-bits. it’s sometimes more than it like 5 because of their technic per-channel-bit-allocation, so I know It’s not completely fair comparison)

Therefore, I guess the post-quantization without training dataset is more properly to add to TVM by this nature, so what do you think?

Also, if TVM enables int4 type computable somehow, this would need to be simulated in software since it’s not normal cpu primitive type. Adding this implementation may require some tedious handling such as promote, trunc, overflow detection and etc. In this case, I think implementing arbitrary precision type handling is reasonable not int4 specific one. it would be able to make it easier other to support other precisions and more special hardware backends.

Intel’s method also requires calibration dataset.

I know the calibration needs dataset but it’s not a one having been used to train the model. I think this way is like the quantization TVM currently supports (KL divergence mode).

We can consider to support Bfloat16 data type in native. More and more hardware will support this type since this year. Because it’s not a standard type in generic programming language(c++ / numpy) and traditional compiler (gcc / llvm), developer is facing a challenge of writing kernel with it. It’s a huge advantage if TVM can generate Bfloat16 kernel efficiently, and TVM is designed to do this well.

1 Like

Thanks for all the suggestions so far, we will summarize the roadmap near the end of the month. The current goal is to aim for a three month timeframe (April).

Besides the set of features we want to see, we would love to what our community members want to work on(either new proposals) or some of these existing ones. It would help us coordinate and estimate feasibilities of these items

I will mainly work on the automated quantization project in the next three months, see the RFC for details:

In summary, I hope that with this project we can provide a easy-to-use quantization tool for users, which can be adapted to different models, and different hardwares quickly.

3 Likes