[Discuss] Running graphs when ops aren't supported

jonso · October 25, 2019, 2:23pm

Problem

In order for TVM to work as a generic compiler for frontend frameworks, it needs to support all of the operators that those frontend frameworks support. Relay already supports many operators for numerous frontends, and this operator list tends to be sufficient for many use-cases and common model architectures. However, it can be hard to support new model architectures or model developers who use unique, specialty operators provided by the framework. This can put us in a constant state of playing catch-up.

As a more concrete example, I was recently investigating an NLP model that uses TensorFlow’s lookup table for embedding lookup, and TensorFlow’s contrib.seq2seq.GatherTree operator as part of beam search. Given that none of these operators are supported in Relay, I started to look into implementation. However, I found it difficult to justify putting in the effort to implement an operator in Relay+TOPI based on a one-off, potentially esoteric use case. Further, ops such as the lookup table should be very fast in TensorFlow, and there really isn’t a need to use TVM to compile it.

Proposed Solution

I think that unsupported operators should not prevent us from running a graph. When an operator is not supported, we can “fallback” and run this operator in the original framework’s runtime. This hybrid approach will certainly not be as performant as using the TVM runtime for the entire graph, but will unblock users for running graphs with new model architectures with brand new operators.

As I mentioned above, NLP models are a great example of this. As many people decide to implement their embedding lookups differently, we cannot be certain that all of those ops will be supported. However, the core model logic (such as Transformer or RNN) is generally supported by TVM. A hybrid approach will allow us to run the embedding lookup in the native framework, and use TVM to optimize the core model, which also tends to be more computationally expensive.

Proposed Implementation

I propose creating a new operator in TVM that will run a generic graph or subgraph in the native frontend framework.

Let’s look at an example for TensorFlow:

When we see an operator that is not in the convert map, we can create a Relay node for a new op called TensorFlowRunner. Given that TensorFlow can execute subgraphs by simply passing input / output nodes into session.run, this operator needs to take in: input tensor names, output tensor name, and the serialized graph definition that was being used in the TF frontend (this can be a string attribute). All other parameters and attributes can be inferred from the graph definition.

This operator will be implemented as a TOPI contrib library. The first time this operator is executed, it will JIT create the session from the graph definition and cache it. It will then call session.run given the input tensor names and output tensor name, returning the output tensor. All subsequent calls to this operator will use the cached session. In fact, any call to the TensorFlowRunner operator in the context of a graph execution can use the same session, since session.run can be called with different arguments.

This feature will be opt-in, as TVM will need to be linked to the frontend runtime. We can also add a parameter like fallback_when_op_not_supported to the from_tensorflow method.

I had thought of other implementations - such as using a custom TVM op in TensorFlow and manually splicing the graph and running subgraphs in their respective frameworks. The first solution is challenging in that it requires the user to have the model source code. I believe the correct solution should work even when we only have the exported graph. The second solution is challenging because it requires manually splicing the graph, converting spliced nodes to explicit inputs and outputs, and handling nodes that "pass-through" between subgraphs when they are not inputs or outputs.

I’m looking forward to hearing what you think!

cc @tqchen @jroesch @jwfromm

tico · October 28, 2019, 8:07am

Hi,

This is definitely a common problem. The hybrid solution that you are describing sounds like “SelectOps” approach in TFLite in which when there is an unsupported operator in TFLite then the vanilla TF implementation is used.

In the past the way I have solved this is by doing graph surgery in which one part is executed and optimized in TVM and small portions with unsupported operators are executed in TF. For example, a model could be splitted into 3 subgraphs as follows TVM -> TF -> TVM. Of course, this is a manual and tedious process, so having a more automated way to do it in TVM would be great!

thierry · October 25, 2019, 3:21pm

That’s definitely a welcome feature @jonso! This would also be useful on operators that don’t benefit from highly optimized support in TVM TOPI. I would imaging wanting to fall back on subgraphs that would run faster in the native framework than in TVM, until TOPI support improves for the specific operator-shape-target tuple.

jonso · October 25, 2019, 4:08pm

Awesome, thanks for the feedback!

Personally, I would prefer to go with my proposed implementation (native runtime operator) vs manually extracting subgraphs. Manually extracting subgraphs can get very complicated, as we have to understand a subgraph’s input, output, and passthrough nodes, as well as how to stitch multiple subgraphs together. I can work on putting together a prototype of the native runtime operator.

In the future, we can get even sneakier - we can have an IR pass to merge multiple native runtime operators into a single op that runs a larger subgraph.

What are your thoughts implementation-wise?

FrozenGene · October 25, 2019, 4:37pm

My main concern is when we deploy the model into production environment, which doesn’t have tensorflow. We have to implement it into relay + topi like we do now.

jonso · October 25, 2019, 4:39pm

@FrozenGene that can be solved by shipping the TF runtime alongside the TVM and TOPI binaries. It is up to the user to do this (not TVM’s responsibility).

janimesh · October 25, 2019, 4:45pm

This is definitely a very common problem, that TVM users have to deal with frequently I think for this you would need some kind of graph partitioning within Relay, right? (Not for manually slicing graphs, but for codegen and runtime). Other worries include ShapeInference, we will need a way to setup this black-box operator shape in Relay, to enable shape inference. But, all these efforts are worth it for making TVM more friendly

jonso · October 25, 2019, 4:58pm

@janimesh my goal here is to not have graph partitioning - we can just run single ops in the native runtime. For shape inference, I was thinking we could run the TF graph up until that op with simulated input, get the output shape, and pass it as an attribute to the Relay op.

If people are interested, it’s also worth discussing what a graph-partitioning-based method would look like

FrozenGene · October 25, 2019, 5:19pm

Suppose we are deploy into one IoT devices. We should firstly cross compile tensorflow. If the IoT devices has limited memory, which maybe can not accept big tensorflow runtime size. These problems wouldn’t be existed in cloud environment. But TVM’s one main goal is to make us deploy these IoT devices. We ever even have one pr to create one minium runtime (ARM 12KB). So I don’t think it is not TVM’s responsibility. I agree fallback to original framework is one step to support more operators, however we should consider IoT devices which can not contain original framework. We can not put it aside and leave it to users.

FrozenGene · October 25, 2019, 5:18pm

Graph partition maybe one better way. Besides supporting unsupported operators, we can offload some ops to other frameworks like TensorRT / intel cldnn to accelerate. Even we could support ops execute on NPU. So I think graph partition could do more.

jonso · October 25, 2019, 5:25pm

@FrozenGene how do you propose doing automatic graph partitioning? Would this also require the user to manually stitch the graphs together?

FrozenGene · October 25, 2019, 5:38pm

@jonso Graph partition is another story, which also can not solve IoT devices problem I said before. But if we use graph partition to support unsupported ops, one way is we could recognize TVM supported ops in our frontend, others ops could be named as one special op kind like tf_op. Then when we in relay, we will do graph partition. tf_op is one subgraph executed by tensorflow(when TVM see this op, we will skip and doesn’t compile), other ops will be compiled and executed by TVM. This doesn’t require users manually do anything but require some logic in relay partition and runtime part.

jonso · October 25, 2019, 5:47pm

I definitely agree with the idea, I’m just concerned about implementation. For example, say we have a simple graph, A -> B -> C. A and C are supported by Relay, B is not. We can easily compile A, then mark B to be run as a subgraph in the native runtime. However, to compile C, we have to explicitly modify the incoming graph to change B into a placeholder.

This gets more complicated when you have nodes that aren’t directly connected, but are still needed. For example, say we have the graph A -> B -> C, but C actually takes both A and B as input. Here, A is called a “passthrough” input. We will have to parse the graph to understand that we have to create a new placeholder, A, to pass to C.

Extend this to a graph like A -> B -> C -> D -> E, where A, C, and E are executed on TVM, but B and D are executed on TF. E takes A, B, and D as input. IMO this graph parsing and reconstruction is quite complex.

@tqchen do you have any thoughts here?

FrozenGene · October 25, 2019, 5:52pm

@jonso You are correct. Graph partition is key part here. MXNet community has one good proposal we could refer: https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+backend+libraries

jonso · October 25, 2019, 7:02pm

Thanks for linking that, it’s a really interesting doc. Their implementation actually seems more similar to my original implementation + an IR pass to fuse consecutive nodes on TF into one subgraph.

On an IoT device, how will the TF subgraph be executed? Who will be orchestrating the subgraph execution?

tqchen · October 25, 2019, 8:09pm

Great discussions. I think we need to approach the problem in two perspectives:

How are these additional runtime presented, stored and linked with existing tvm runtimes.
How do we build compiler to compile to these runtimes.

Here is an example code to demonstrate proposed flow(in python API, API is up to discussion):


mod = frontend.from_tensorflow(xxx)

# split out tf related functions into sub functions
mod = relay.transform.SplitTensorFlowFallbacks()(mod)
# collect all related customized compilation functions into a new module
mod_tf = relay.transform.ExtractCustomCompilation("tensorflow")(mod)
mod_normal = relay.transform.ExtractNormalFunctions()(mod)

# hooks into customized build callbacks, and build a runtime tf module
runtime_tf_mod = relay.build(mod_tf)
runtime_mod_normal = relay.build(mod_normal)

# NOT: runtime_mod_normal could call into functions in runtime_tf_mod
runtime_mod_normal.import_module(runtime_tf_mod)

# This should work out of box because we have defined the Save/Load Binary interface of tf runtime module
runtime_mod_normal.export_library("xyz.so")

How to build plugin runtime modules

What is being discussed here fits well with our runtime Module design. Note that TVM has runtime::Module defines two important aspects:

How do we serialize the module(which could then be redirected to the runtime specific serialization)
How to run expose the runtime as function(by return the function in PackedFunc interface) to the way tvm recognizes.

For third_party runtimes like tensorflow/NPU or other code in practice we could build a runtime/contrib/tf_runtime which uses tf’s API to execute the graph, and exposes everything as PackedFunc.

Then the other runtimes(graph rt, vm) should be able to use these functions just by importing the corresponding module. Because each module defines their own serialization function, we could use the native serialization function by the corresponding runtime, and things should work out of box.

How to do compilation

In terms of compilation, we will need a customized flow that takes a relay::Function(after partition), and then call into a customized compiler, which directly gives the specific runtime, which then can be imported into the graph runtime, and serialized together(as .so file or other formats)

This is something that complements Add TensorFlow custom op and run tvm in TensorFlow which @tobegit3hub In the meanwhile, as we improve op coverage the

jonso · October 25, 2019, 9:10pm

Thanks a lot for the design - it does seem like a Module plugin is the way to go over a topi plugin. How would this work with multiple subgraphs? For example, if there are several unconnected operators that should be run in TF? How would we orchestrate calling TVM -> TF -> TVM -> TF…?

The one befit of a topi external operator is that all of this would be taken care of for us by the graph runtime.

tqchen · October 25, 2019, 9:37pm

I think the only difference between Module and external op is the serialization perspective, each runtime might want to design their own serialization and when needed, could have internal states, while extern op relies on conformation to the op level and does not have internal states (need to rely on TLS).

Because runtime:: Module does support returning multiple PackedFuncs, each sub-graph will become a Packed function that the graph runtime can call. So for graph runtime, the calling will only be from the TVM RT->PackedFuncFromTFModule, but not the other way around

jonso · October 25, 2019, 11:27pm

I see, that makes sense. I think I am still missing one piece to put this all together: will we still to insert a dummy Relay op into the TVM module so we know when to call the TF module? I think I am misunderstanding the final compilation part.

Also, given that I’m not very familiar with runtime::Module, and this seems like a sizable feature, would it be worthwhile to have a call to discuss the design?

tqchen · October 26, 2019, 12:27am

Yes, it is a great idea to have a discuss thread in the community.

The way that the relay compilation works right now is, a relay function get partitioned into a function that calls into primitive ops, and the primitive ops as functions. Then the primitive ops get compiled into tvm PackedFunc(which becomes the dso module), and the function that calls into primitive ops becomes the graph runtime module.

What we want instead would be lift some of the sub-graphs into a sub-function, and pass those functions to the custom compile flow that gives us a TF module. The mechanism to call into these functions should remain exactly the same as the mechanisms to call into the functions in dso module of normal tvm compilation flow.