In order for TVM to work as a generic compiler for frontend frameworks, it needs to support all of the operators that those frontend frameworks support. Relay already supports many operators for numerous frontends, and this operator list tends to be sufficient for many use-cases and common model architectures. However, it can be hard to support new model architectures or model developers who use unique, specialty operators provided by the framework. This can put us in a constant state of playing catch-up.
As a more concrete example, I was recently investigating an NLP model that uses TensorFlow’s lookup table for embedding lookup, and TensorFlow’s contrib.seq2seq.GatherTree operator as part of beam search. Given that none of these operators are supported in Relay, I started to look into implementation. However, I found it difficult to justify putting in the effort to implement an operator in Relay+TOPI based on a one-off, potentially esoteric use case. Further, ops such as the lookup table should be very fast in TensorFlow, and there really isn’t a need to use TVM to compile it.
I think that unsupported operators should not prevent us from running a graph. When an operator is not supported, we can “fallback” and run this operator in the original framework’s runtime. This hybrid approach will certainly not be as performant as using the TVM runtime for the entire graph, but will unblock users for running graphs with new model architectures with brand new operators.
As I mentioned above, NLP models are a great example of this. As many people decide to implement their embedding lookups differently, we cannot be certain that all of those ops will be supported. However, the core model logic (such as Transformer or RNN) is generally supported by TVM. A hybrid approach will allow us to run the embedding lookup in the native framework, and use TVM to optimize the core model, which also tends to be more computationally expensive.
I propose creating a new operator in TVM that will run a generic graph or subgraph in the native frontend framework.
Let’s look at an example for TensorFlow:
When we see an operator that is not in the convert map, we can create a Relay node for a new op called
TensorFlowRunner. Given that TensorFlow can execute subgraphs by simply passing input / output nodes into
session.run, this operator needs to take in: input tensor names, output tensor name, and the serialized graph definition that was being used in the TF frontend (this can be a string attribute). All other parameters and attributes can be inferred from the graph definition.
This operator will be implemented as a TOPI contrib library. The first time this operator is executed, it will JIT create the session from the graph definition and cache it. It will then call
session.run given the input tensor names and output tensor name, returning the output tensor. All subsequent calls to this operator will use the cached session. In fact, any call to the
TensorFlowRunner operator in the context of a graph execution can use the same session, since
session.run can be called with different arguments.
This feature will be opt-in, as TVM will need to be linked to the frontend runtime. We can also add a parameter like
fallback_when_op_not_supported to the
I had thought of other implementations - such as using a custom TVM op in TensorFlow and manually splicing the graph and running subgraphs in their respective frameworks. The first solution is challenging in that it requires the user to have the model source code. I believe the correct solution should work even when we only have the exported graph. The second solution is challenging because it requires manually splicing the graph, converting spliced nodes to explicit inputs and outputs, and handling nodes that "pass-through" between subgraphs when they are not inputs or outputs.
I’m looking forward to hearing what you think!