Relay VM Just In time compilation

@kevinthesun @haichen let’s continue the discussion on https://github.com/dmlc/tvm/pull/4129 here if you wish.

I want to clarify that this is not a half-baked solution: this technique is known as polymorphic inline cache, and has been known to exist for a long time. It also cover most dynamic use case.

If we count all the model tvm support (by looking at testing, seeing the benchmark, the tvm paper, or the relay paper), most of them are vision (densenet, resnet, dcgan, vgg, mlp). they all have a batch parameter. some of the rest are lstm and treelstm, or slight modification of them (rnn, gru, bilstm, which is a concatenation of two lstm), and there are significantly less nlp model then vision model. Finally, in the nlp model, there are a few model that sort of exist - (transformer, bert). Due to various technical reason surrounding vm and any, they are mostly in developement and had not been in public yet.

I failed to see why not supporting some model that doesnt exist in tvm yet, but supporting training of all current vision model, is a limited approach.

I also disagree that it will be replaced soon. Suppose graph dispatching comes out tomorrow, this still has advantage over graph dispatching - this will give the utmost performance for more then half of the model (the vision). For the rest, half of them wont be effect, and a small minority of them (bert) will this be worse then graph dispatching.

Another suspicion is, I had looked at graph dispatching, and judging by the RFC, it has to made nontrivial change to Relay, topi, and possibly autotvm. Do you guys have an estimate of how long it will take?

What do you mean when you say that transformer models “sort of exist”? I have been successfully running BERT models on TVM for a few months now.

I also think that some good points were brought up regarding the runtime - if we do this type of JIT compilation, then we have to embed the compiler in the runtime. Besides increasing the size of the runtime, will this work for embedded devices?

thanks for the discussion. It would be great if we can create a RFC and moderate the discussion a a bit in a more neutral way. List solutions and alter natives and understand the trade offs. In particular, try to summarize the key technical points (jit cache, dispatch in compile time etc) and pros and cons(need compilation in runtime).

Then we go from there.