[RFC] Discuss TVM v0.7 Roadmap

Thanks for everyone’s effort we have successfully released v0.6. As per Apache way of development, we want broader involvement and inputs from community developers on what we want to push for in the v0.7 release cycle. Previously, a roadmap thread is created in the github, we want to try something a bit different in this cycle by having a discussion thread summarizing the directions we would like to see in the v0.7 release cycle. The final roadmap will be a summary based on the discussion.

In terms of timeline, let us set the expectation for what we want to see in the next three months. But other inputs to the longer term projects are also welcomed. Here are some example directions that we saw during the tvm conference:

  • More transparency in terms of roadmap and directions – this thread
  • More auto scheduling
  • Enhanced support for high-level graph rewriting for accelerators
  • Better documentations for developments
  • Unified runtime for heterogeneous execution
  • End to end uTVM
  • Unified IR improvements
  • More dynamic model support

Please share your thoughts. In particular:

  • What you would like to see in the roadmap
  • What you would like contribute to the community
4 Likes

Given that many of us are going on holiday break. We will keep this discussion thread open until the mid of Jan

It’ll be very nice to have all these new features, especially “Unified runtime for heterogeneous execution” and “better documentation for developers”.

In addition, I think we are also getting close to have

  • External code generator
  • End to end inference with Chisel VTA (, with multiple drivers support)

Are there any interests in 4 bit quantization? At Neurips 19 Intel published an interesting paper which demonstrates high accuracy inference in 4 bit. I think their method fits well to TVM’s automatic quantization support (workflow is similar to KL divergence minimization based approach).

Besides the quantization method itself, we also need to consider how to handle packing and unpacking of 4 bit values in codegen and runtime.

4 Likes

A few things that we could add to v0.7 roadmap

  • Complete VM functionality
  • Improve VM performance
  • Add tutorial for VM
  • Reduce AutoTVM tuning time (could be related to auto scheduling)
  • Auto tensorization
  • half2 data type support
2 Likes

It would be nice If we can have Relay Visualization tools. It helps developers to see the graph-level optimizations.

2 Likes

What is the perceived release cycle ? Are we planning 0.7 in March / April 2020 time frame ?

A few things to consider.

  • Command line utilties to use TVM as an ahead of time compiler. We are already starting a discuss post on that.
  • Improving test and benchmark infrastructure.
  • Testing and benchmarking on remote targets.

Could configurable fusion fall under this category? There was some discussion/interest in a previous thread. Since fusion rules are hardcoded at the moment, it would be nice to have an easy way to set fusion rules for different devices. In a heterogeneous environment or SOC with multiple kinds of accelerators this could ease pains of rewriting fusion passes for every device.

1 Like

It would be nice to consider spending some time in performance improvement (on one specific platform compared with framework or library provide by hardware vendor) and submit to MLPerf (https://mlperf.org/inference-results/?spm=ata.13261165.0.0.41ac3a78f1Uzk9) in v0.7 release cycle.

3 Likes

It would be very helpful for TVM’s application if the relay.frontend.from_tensoflow can correctly convert models from the Tensorflow object_detection library.

This would super helpful to see a working version of auto scheduler, as many have mentioned above.

1 Like

Would adding a Netron (https://github.com/lutzroeder/netron) backend for Relay accomplish this? Or are you interested in visualizations besides the computation graph?

1 Like

Nvidia has an interesting post recently demonstrating their fine-tuning method without retraining the original model for int4 and has only ~1% accuracy loss for resnet50. Their results could be a competitive comparison for TVM 4 bit quantization.

If we assume that we have already got high-accuracy int4 models, we should also focus on int4 tensor core optimization for model inference (mostly for convolution) to increase the speed of inference which other DL acceleration libs don’t support yet.

3 Likes

Making loop partition better is important for dynamic shape scheduling. If we have some fully optimized small kernels (gemm, conv), ideally accelerated by tensor cores, when we receive a dynamic larger kernel size, loop partition should be able to decompose it to make use of the fully optimized small kernels without introducing many if conditions (ideally less than or equal to one for the remainder computation). And in this way, the default performance should be fairly acceptable for dynamic shape scheduling without further tuning.

2 Likes
  • Tensorization of composition ops

Right now, for example, we cannot tensorize for an accelerator with inner loops that are a composition of gemv+bias+act. This makes CodeGen difficult for many accelerators.

I think the nvidia’s post seems to need the original dataset for their last fine-tuning step but intel’s research doesn’t need them, it’s just pure post-qunatization method with some stochastic data processing only while the final accuracy loss (< 3%) on the same model is slightly less than that of nvidia (< 1%).
(The intel’s method don’t use just 4-bits. it’s sometimes more than it like 5 because of their technic per-channel-bit-allocation, so I know It’s not completely fair comparison)

Therefore, I guess the post-quantization without training dataset is more properly to add to TVM by this nature, so what do you think?

Also, if TVM enables int4 type computable somehow, this would need to be simulated in software since it’s not normal cpu primitive type. Adding this implementation may require some tedious handling such as promote, trunc, overflow detection and etc. In this case, I think implementing arbitrary precision type handling is reasonable not int4 specific one. it would be able to make it easier other to support other precisions and more special hardware backends.

Intel’s method also requires calibration dataset.

I know the calibration needs dataset but it’s not a one having been used to train the model. I think this way is like the quantization TVM currently supports (KL divergence mode).