AutoTVM has two sides, how could we improve our experience?


#1

As demonstrated, AutoTVM has big performance improvement and is one killer feature in TVM. However, AutoTVM will introduce one disadvantage, i.e. its tuning time is very long, especially on GPU / remote embedded devices.

If we have one prepared model, it is somehow can be accepted. However, if we want to evaluate the performance quickly, it can not satisfy. Let us imagine we want to prune / compress one model and see which one is better, we can not get quick response run time like other inference framework (like tflite), because we need tuning, maybe we need hours.

I want to introduce this thread to discuss, how could we improve the experience when AutoTVM’s tuning time is so long. Sometimes it even will limit that we do the experiment described before.


#2

This is an important discussion to have. I think one big missing piece is approximate template matching. Right now if a workload doesn’t have an exact match in tophub, the default fallback configuration is used which of course leads to slow inference. However, workloads of similar shapes / type are highly likely to share the same optimal schedule. It would make sense to introduce some way to recommend the most likely good schedule for a new workload instead of having to tune it from scratch. That would make the tophub predefined schedules apply to many new models.


#3

This is indeed a good point. First of all, existing approaches are not optimal because the code themselves are optimized for certain workloads(e.g. resnet) and directly using these libs for other workloads will results in suboptimal results. For example, the recent OctConv paper from fb showed that by using tvm they get 2x speedup but the original library brings slowdown to the workloads.

The ultimate solution to this comes in two folds, first of all, we could improve the infra of autotvm to be able to interpolate, and transfer, just like what @jwfromm mentioned. Alternatively, we could work on performance predictor to predict the best perf we could get before we get to the best kernel. A hybrid of these approaches would eventually gives us better pipeline jointly. @eqy also did some explorations in this direction.

I think it is a great discussion to have and we want to build infra to allow exploration of both directions


#4

Indeed, @eqy explored some ideas to perform whole network tuning, and spend time on the operators that resulted in overall better speedup, so the inference time would be minimized under an overall time budget constraint.

I wonder if that would be easy to upstream to the community so we could focus on strategies to improve it over time. Essentially this would equate to doing an exploratory tuning job in an hour to get an idea of the ballpark performance of a given model on a given hardware plaftorm, before spending a day tuning it to get near-optimal performance. As more optimizations and graph transformations are applied, this can become crucial to quickly assess if those graph transformations are worth it.