[Discuss] Running graphs when ops aren't supported

@FrozenGene how do you propose doing automatic graph partitioning? Would this also require the user to manually stitch the graphs together?

@jonso Graph partition is another story, which also can not solve IoT devices problem I said before. But if we use graph partition to support unsupported ops, one way is we could recognize TVM supported ops in our frontend, others ops could be named as one special op kind like tf_op. Then when we in relay, we will do graph partition. tf_op is one subgraph executed by tensorflow(when TVM see this op, we will skip and doesn’t compile), other ops will be compiled and executed by TVM. This doesn’t require users manually do anything but require some logic in relay partition and runtime part.

I definitely agree with the idea, I’m just concerned about implementation. For example, say we have a simple graph, A -> B -> C. A and C are supported by Relay, B is not. We can easily compile A, then mark B to be run as a subgraph in the native runtime. However, to compile C, we have to explicitly modify the incoming graph to change B into a placeholder.

This gets more complicated when you have nodes that aren’t directly connected, but are still needed. For example, say we have the graph A -> B -> C, but C actually takes both A and B as input. Here, A is called a “passthrough” input. We will have to parse the graph to understand that we have to create a new placeholder, A, to pass to C.

Extend this to a graph like A -> B -> C -> D -> E, where A, C, and E are executed on TVM, but B and D are executed on TF. E takes A, B, and D as input. IMO this graph parsing and reconstruction is quite complex.

@tqchen do you have any thoughts here?

@jonso You are correct. Graph partition is key part here. MXNet community has one good proposal we could refer: https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+backend+libraries

Thanks for linking that, it’s a really interesting doc. Their implementation actually seems more similar to my original implementation + an IR pass to fuse consecutive nodes on TF into one subgraph.

On an IoT device, how will the TF subgraph be executed? Who will be orchestrating the subgraph execution?

Great discussions. I think we need to approach the problem in two perspectives:

  • How are these additional runtime presented, stored and linked with existing tvm runtimes.
  • How do we build compiler to compile to these runtimes.

Here is an example code to demonstrate proposed flow(in python API, API is up to discussion):


mod = frontend.from_tensorflow(xxx)

# split out tf related functions into sub functions
mod = relay.transform.SplitTensorFlowFallbacks()(mod)
# collect all related customized compilation functions into a new module
mod_tf = relay.transform.ExtractCustomCompilation("tensorflow")(mod)
mod_normal = relay.transform.ExtractNormalFunctions()(mod)

# hooks into customized build callbacks, and build a runtime tf module
runtime_tf_mod = relay.build(mod_tf)
runtime_mod_normal = relay.build(mod_normal)

# NOT: runtime_mod_normal could call into functions in runtime_tf_mod
runtime_mod_normal.import_module(runtime_tf_mod)

# This should work out of box because we have defined the Save/Load Binary interface of tf runtime module
runtime_mod_normal.export_library("xyz.so")

How to build plugin runtime modules

What is being discussed here fits well with our runtime Module design. Note that TVM has runtime::Module defines two important aspects:

  • How do we serialize the module(which could then be redirected to the runtime specific serialization)
  • How to run expose the runtime as function(by return the function in PackedFunc interface) to the way tvm recognizes.

For third_party runtimes like tensorflow/NPU or other code in practice we could build a runtime/contrib/tf_runtime which uses tf’s API to execute the graph, and exposes everything as PackedFunc.

Then the other runtimes(graph rt, vm) should be able to use these functions just by importing the corresponding module. Because each module defines their own serialization function, we could use the native serialization function by the corresponding runtime, and things should work out of box.

How to do compilation

In terms of compilation, we will need a customized flow that takes a relay::Function(after partition), and then call into a customized compiler, which directly gives the specific runtime, which then can be imported into the graph runtime, and serialized together(as .so file or other formats)

This is something that complements Add TensorFlow custom op and run tvm in TensorFlow which @tobegit3hub In the meanwhile, as we improve op coverage the

2 Likes

Thanks a lot for the design - it does seem like a Module plugin is the way to go over a topi plugin. How would this work with multiple subgraphs? For example, if there are several unconnected operators that should be run in TF? How would we orchestrate calling TVM -> TF -> TVM -> TF…?

The one befit of a topi external operator is that all of this would be taken care of for us by the graph runtime.

I think the only difference between Module and external op is the serialization perspective, each runtime might want to design their own serialization and when needed, could have internal states, while extern op relies on conformation to the op level and does not have internal states (need to rely on TLS).

Because runtime:: Module does support returning multiple PackedFuncs, each sub-graph will become a Packed function that the graph runtime can call. So for graph runtime, the calling will only be from the TVM RT->PackedFuncFromTFModule, but not the other way around

I see, that makes sense. I think I am still missing one piece to put this all together: will we still to insert a dummy Relay op into the TVM module so we know when to call the TF module? I think I am misunderstanding the final compilation part.

Also, given that I’m not very familiar with runtime::Module, and this seems like a sizable feature, would it be worthwhile to have a call to discuss the design?

1 Like

Yes, it is a great idea to have a discuss thread in the community.

The way that the relay compilation works right now is, a relay function get partitioned into a function that calls into primitive ops, and the primitive ops as functions. Then the primitive ops get compiled into tvm PackedFunc(which becomes the dso module), and the function that calls into primitive ops becomes the graph runtime module.

What we want instead would be lift some of the sub-graphs into a sub-function, and pass those functions to the custom compile flow that gives us a TF module. The mechanism to call into these functions should remain exactly the same as the mechanisms to call into the functions in dso module of normal tvm compilation flow.

@jonso, @tqchen should have answer your question. For IoT devices, I want to emphasis the key part is if we can not have these runtime, how do we do? Cross compile these runtimes is not good idea.The memory limit is even worse condition. My concern is we open one pandora box and get the sweet taste so that we don’t have strong willings to support operators persistently, which will hurt IoT devices area. We should continue to support relay + topi operators and get lightweight deploy experience. So when we design this RFC, we should have one section to express how do we handle it on IoT devices when it can not support other runtimes, it should also be the first priority. Wish you understand my concern.

I agree. The native compilation support for end to end pipeline should always be the first class goal of the project.

The additional libraries are flexible ways to interface with the existing frameworks, but won’t help us in cases that we might need the tvm stack the most such as IoT, new hardwares and accelerators. Even for server class applications, compilation would give quite some benefits, such as minimized runtime deployment(minimum containerization) protection etc.

1 Like

@FrozenGene I completely agree, native compilation is always the goal. Personally, I’m hoping that supporting native runtimes will help us prototype TVM for new model architectures faster, and help prioritize dev work in terms of operator support.

@tqchen thank you very much for the detailed explanation! I did a deep-dive into the docs and the code, and realized that I had a fundamental misunderstanding of how the final graph is run. It seems that the graph JSON determines the order that functions are run in. Is that correct?

I’ll start working on a prototype and post here when I have something reasonable. We can discuss in more detail from there!

Btw, I am planning on starting from supporting graph runtime. Is there an ETA for Relay VM to be mainstreamed? If it’s soon, maybe I should focus on supporting that instead of the graph runtime.

@comaniac and I’ve been working the 3rd party integration and graph partitioning for a while, we have a POC implementation. I have just posted an RFC for what we have been doing.

1 Like

@jonso Relay VM is able to execute all models that graph runtime support now, with ok performance(<10% cost due to lack of static memory planning, which is WIP by @jroesch and @haichen). I’d recommend try graph runtime first since it’s stable enough and won’t change a lot in the future. I expect there will be a few major iterations in VM implementation next.

I think the proposal about additional runtime support has nothing to do with the choice of relay vm or graph runtime because both VM and runtime will invoke the additional runtime as opaque PackedFuncs.

Oops thanks for correcting me. I definitely missed the context here. If we choose PackedFunction to be the interface between various runtimes, graph runtime and vm should be able to share most of the infrastructure we setup.

One of the things to keep in mind is that this likely only works for “elementary ops”, but it is unclear what to do with ops like control-flow that have blocks / subgraphs, in particular when these blocks / subgraphs contain bits that TVM would like to optimize.

Also frameworks like PyTorch, being supportive of interoperability, have rather excellent support for extracting subgraphs of “known” ops, so dismissing that route because TF does not (which is my reading of the last paragraph, I don’t have first-hand knowledge of it) is probably a decision optimizing for handling TF graphs rather than the general case.

Personally, I’m probably not looking at TVM as a generic compiler for frontend frameworks but see most value in the fusion / optimization framework and would love to seamlessly get the benefits TVM can offer in the framework I’m already using (which happens to by PyTorch in my case). The notion of converting models from one framework to another might be something that is attractive/familiar to users of TF, but from hanging out in the PyTorch forums I get the feeling people are glad when they don’t have to convert models for deployment.

Best regards

Thomas

1 Like

@jonso Is there any development/PR related to this RFC?

As @FrozenGene mentioned graph partitioning + external codegen can be used as a way to address unsupported operators. Maybe the recent BYOC feature + the ongoing work on [RFC] Op based annotation for external codegen is a solution to this challenged?

@tico this is definitely something that we should be able to support now, I just haven’t had time to work on it :frowning: I’m hoping to be able to get back to it in a few weeks. I’ll keep you updated on my progress, or if you want to work on it feel free!