Bring Your Own Codegen to TVM

zhiics · October 27, 2019, 12:09am

Bring your own codegen to TVM + Graph Partitioning

The goal is to come up with a right Relay subgraph data structure/abstraction so that we can more conveniently allow thrid-party library and hardware vendors to bring their own codegen tools to TVM.

This RFC involves design and implementation in the following aspects at least.

Graph coloring
- Providing HW vendors an infra to customize where they want to execute an op.
Graph partitioning
- A Relay pass that partitions a program into segments that could be executed on various hardware platforms.
Code generation
- Generate code for each segment of a partition Relay program.
Artifact serialization
- Provide functionality to support save/load of the compiled artifacts.
Runtime
- Integrate other runtimes/execution engines or invoke the external library code/subgraph through both graphruntime and VM (the current POC implementation is using VM).

Model Coverage

CNN: MLP, VGG, ResNet, SqueezeNet, Inception V3, etc.
CV: SSD with ResNet 50, MobileNet, VGG-16, etc.
NLP models are not supported well yet in Relay so we will revisit them in the future.
And more…

Coloring - Group nodes with the annotation to a minimum Number of subgraphs.

Problem Formulation
- Input
  - Given a Relay graph with extern op annotations (added by users or by some internal mechanisms).
  - The I/O of each node (op) may or may not have annotations to indicate if this node is suggested to be offloaded.
- Output
  - A graph with minimum annotations on edges indicating the boundary of subgraphs.
Implementation 1: Op-level annotation
- For each op, we have a corresponding check function registered and the checker will be invoked at the compilation time to indicate if we should annotate the op for the 3rd party accelerator to offload. For example, the following shows a checker of conv2d :
  *
```
@reg.register_extern_op("nn.conv2d")
def conv2d(attrs, args, comp):
     return get_extern_op(comp, "conv2d")(attrs, args)
```
  - Note that comp is a string to represent the 3rd party compiler name; the get_extern_op uses hasattr and getattr to obtain the 3rd party specified checkers.
- For HW partners/3rd party library, they only need to implement simply checker functions for each op to specify if they could support an op under certain conditions. The following example shows a case that the accelerator only supports conv2d with floating types.
  *
```
def conv2d(attrs, args):
    type = args[0].output_type_.dtype 
    return (type == 'float32' or type == 'float64')
```
  - Note that HW partners do not need to register this function but just need to implement it under Relay backend/contrib/compiler_name so that the function can be discovered and imported dynamically.
- A Relay IR pass in Python will invoke above function, insert annotations to the graph, and run Algorithm 1 for coloring.
Implementation 2: Subgraph-level annotation
- We also provide an option for HW partners to annotate the graph directly. In this case, they have to implement a Relay IR pass with a use of our APIs to annotate boundary annotations (i.e., subgraph_start and subgraph_end ).

Partitioning - Check/Validate the graph and process graph I/Os.

Problem Formulation
- Input
  - Given a Relay program with boundary annotations (i.e., subgraph_start and subgraph_end ).
  - The boundary annotations can be added by the coloring stage. In this case, the boundary annotations are always valid.
  - Users can directly add boundary annotations to their Relay programs. In this case, we need to validate the annotations before partitioning.
- Output
  - The updated Relay program with subgraphs replaced with sub functions. All annotations should be removed and calls should be inserted to invoked the sub functions.

Codegen - To tell the Relay backend to use external codegen instead of TVM.

Invoke different codegen tools from TVM directly. This needs HW partners to register their codegen tool to TVM as a runtime module.
During compilation, we can traverse the graph and check the attributes of different subgraphs. For example, an external codegen tool has to be invoked once we found that the attribute of subgraph is annotated with an external compiler. For the example above, we can generate a runtime module for 1x1 conv, but we have to invoke external compilers to generate code for the two subgraphs.
- How to register?
  - HW vendors need to register their compiler as a runtime module and at least be able to deal with the following tasks
    - Ingest a Relay function/module and compile it.
    - Ingest TVM input data structures, e.g. NDArray. TVM feeds data in the NDArray format to the subgraph and expects the external accelerator to execute it and return the output in the NDArray as well. Therefore, HW vendors will need consider the conversion of TVM data to whatever data that is compatible to their compiler.
    - Implement the virtual functions of a runtime::ModuleNode , i.e. SaveToFile , SaveToBinary , GetSource , GetFunction , etc. GetFunction is particular important because that’s how we could get the function ptr of a subgraph and invoke it during runtime. An example for the registration of CUDA runtime module is here: https://github.com/dmlc/tvm/blob/master/src/runtime/cuda/cuda_module.cc
- What APIs we need to expose?
  - The major APIs would be similar to other codegen tools that currently baked into TVM, i.e. LLVM and CUDA, etc.

Serialization - Save the subgraphs and load them back

TVM serializes the built artifact into json, params, and library. What do the subgraphs bring us? Each HW vendor has their own artifacts. For example, they may encode the structure of the subgraph into the library, they may need and even modify the params.
Serialize the partitioned subgraphs into a form to save on disk.
Need to let HW partners know what ops are in the subgraph? We should treat a subgraph as a black box, but just feed it with input data and expect to get the correct output from external libraries.
How many libraries? We may generate multiple libraries one for each backend.
- How to load multiple libraries and let the subgraph invoke the correct library?
- Can we combine them into a fat library if the external codegen tool is registered to TVM as a runtime module?

Runtime - Invoke different subgraphs using different libraries

Graph runtime and VM runtime.
Offload a subgraph to the 3rd party library
How to invoke the library and let it take control of the subgraph?
Two cases
- HW vendors have their own runtime.
  - How to coordinate two runtimes?
- HW vendors don’t have their runtime.
  - Only use TVM runtime. We still need the library generated by the external compiler to be able to ingest TVM runtime data and finish the execution of a subgraph.

We have an initial implementation here: https://github.com/zhiics/tvm/tree/partitioning, where we provided support for MKLDNN using DNNL execution enigne and a simple experimental version to allow GCC to ingest NDArray and compile a simple graph. Thanks @jroesch for providing many suggestions. Also part of credits should go to @comaniac for working together.

Any comments and thoughts are welcome:)

@tqchen @wweic @haichen @thierry @ajtulloch @jonso @janimesh @ciphr

tqchen · October 27, 2019, 3:26am

Thanks for the proposal, I think the current proposal over-complicates in terms of the subgraph runtime and serialization part.

I would recommend we just focus on consolidate everything around ``runtime::Module```, which hide questions like how to serialize subgraph, and how to invoke libraries(because they can be defined by the specific subclass of the module and does not have to conform to the same standard).

There can be multiple solutions in terms of compilation, but the key would be the specification of annotating a function and invoke a custom compilation function.

Let us consider to create an RFC to specify these two core issues, both of which are going to be stable.

Then we open another one discuss possible implementation in the compiler side, which in my opinion might take a few iterations, could use different solutions, and can subject to change.

comaniac · October 27, 2019, 3:57am

The description of the proposal might be too abstract to imagine, but the implementation itself is straightforward. For example, our POC branch shows the MKL DNN support with the proposed methodology. As you suggested, all MKL DNN related compilation and runtime details are in the Module and hidden from other parts of Relay. Maybe we can refine the RFC to focus more on the implementation plan instead of the high-level ideas.

zhiics · October 27, 2019, 4:13am

@tqchen You are absolutely right. I believe our current implementation aligns to your expectation. I probably write too much about the serialization and runtime part. We are currently invoking through VM. Most of the things here in these two parts are not quite necessary.

tqchen · October 27, 2019, 4:17am

Here is one possible idea. Let us try to turn the RFC into a tutorial – how to add customized runtime compilation to TVM.

Then we talk about the following things

Runtime

Things to implement in runtime/contrib/xyz
What is the storage format: how to add serialization to runtime::Module
How does the other runtime interacts with the runtime(e.g. via calling PackedFunc)

Compilation

How to implement a compiler that generates the specific module.
- Perhaps we want to highlight that things are different from normal tvm codegen, as inputs could be usually relay
How to hook the customize compilation into the relay.build
- What are the convention that specifies customized target(e.g. an attribute in the function)

tqchen · October 27, 2019, 4:30am

To summarize my thoughts a bit. Currently, we have a good idea about how to implement the feature – through runtime and a compiler callback as shown in the POC thanks to @comaniac .

However, there are quite a few design decisions that can be in flux. Because it is an important feature, let us try to see if we can be as picky as possible when pinning down these decisions.

In other words, what are the key design decisions, here are some example questions

The way to do serialization: we agreed on using Module
The way to do interface: not too clear, e.g. should we take a function:
- relay.Module->runtime.Module, how does the function name in relay translates to the symbol name(string) in the runtime.Module
  - How does other modules know which PackFunc to call
What is the convention of separation
- e.g. a split pass that converts a function into multiple subfunctions, those that needs special compilation get a special attribute (e.g. target=“mkldnn”)
  - What is the name of the attribute
How exactly does relay.build invoke these and links things together

Note that these design decisions affects how do we specify the data structure convention and interface, which is going to affect us in the long term. It would be great for each of us to answer these questions, then we have a clear guideline.

zhiics · October 27, 2019, 4:58am

@tqchen I agree we can turn it into a tutorial. That’s exactly what I want to do as well.

Interfaces
we currently can take both a module or a function. For example, we pack all individual subgraphs belonging to the same codegen tool into a Relay Module and send it to this codegen. The currently mapping is actually not from each Relay Function to the symbols in runtime Module, but actually from each op in the a subgraph to a symbol (e.g. subgeraph1_add, subgraph2_add). We can discuss more about this.
Convention of separation.
The current name is “compiler”. We can use “target” or “extern_target” as well. This one could be integrated to the partitioning interface as well, i.e. relay.transform.PartitionGraph(extern_target="mkldnn")(mod)
How relay.build can invoke it?
Currently, we get it through VM. A separate pass relay.transform.PartitionGraph is used to make the subgraph contain subgraphs. We haven’t added it into build pipeline. I actually have a plan to move it to the build pipeline as well by adding something to the build to build_config interface (how about adding “extern_target=None” to build). We need to think a little more about how we can links things together. I am not sure if we can link some extern modules together and export them out as a blob.

Yeah, I totally understand that this design will impact our code base. Let’s make sure we can achieve consensus on the interfaces before we send the PR.

tqchen · October 27, 2019, 4:36pm

Given that @jonso originally proposed the idea, would be great if you can also work together:

To get everyone’s input and summarize your key choices
“Challenge” the RFC by asking hard questions that might affect the design
- How would we do X

jonso · October 28, 2019, 8:24pm

@zhiics @comaniac I’m happy to work with you on the design. A couple of thoughts I had:

I think that op-level annotation and registration through a decorator is a clean way to specify ops which are compiled and run by an external module
Should we split up compilation / runtime? For example, each runtime should have its own runtime::Module that can do serialization independently. It will probably be best to put compilation in a different place.
For graph partitioning and compiling in relay.build, I can think of a couple of solutions:
- Add a new field to the target string. For example llvm -mcpu=core-avx2 -extern=mkldnn. In relay.build, we can extract the extern target, partition the graph, compile, and generate the runtime.
- Add a parameter to the build_config as you suggested. We can then extract that value and partition, compile, and generate the runtime.
Personally, I feel that the target string option is the cleanest. What do you think?
I don’t think that naming functions in a module by their op name is sufficient. For example, say I am plugging in the TensorFlow runtime so I can run unsupported ops. Before partitioning, I can have an IR pass to group nodes that are entirely enclosed in TensorFlow ops into a single node. Running this whole subgraph in the TensorFlow runtime will be more performant than running each node individually. Maybe the naming convention can be [external runtime name]_op1_op2_op3...

Let me know if you want to schedule a call to discuss the details or if I can help in implementation in any way this will be a really useful feature, and I’m glad to see that other people think so too.

comaniac · October 28, 2019, 8:59pm

@jonso thanks for the comments.

We put both compilation and runtime to runtime::Module for two reasons: First, it is consistency with the current third-party supports such as CBLAS and CUBLAS. Second, this is simpler for contributors since they only need to maintain one place.
I’m fine with both for relay.build.
We do not name functions by their op name but an internal assigned ID for serving the purpose you mentioned. @zhiics is working on the tutorial for this RFC as Tianqi suggested, and I believe it would be clearer by going through it, but let me provide brief use cases here to give some flavors:

Our proposal includes two ways of graph partitioning: op-level annotation and graph-level annotation. For op-level annotation, developers only need to specify which ops are supported by the backend, and our partitioning pass will generate a function call for each supported op. Currently every single supported op will be a function, but we plan to develop an algorithm to group supported ops to one function to reduce the overheads like you mentioned. On the other hand, while the benefits of grouping ops is obvious in graph runtime, it is moderated in interpreter and VM. That’s also why we put the algorithm development to the second phase.

For graph-level annotation, we allow developers to write a Relay pass to insert subgraph_begin and subgraph_end annotations to the graph. As you can imagine, this provides freedoms for developers to implement any partitioning algorithm they designed for their backend. This can also be a workaround solution for the first phase of this RFC that we don’t have a well-developed partitioning algorithm yet.

For the TVM unsupported ops, my suggestion is that since our POC branch has a mechanism to check if an op is supported by a specific backend or not, we can treat all TVM unsupported ops as customized ops when converting the model from TF, and let this mechanism offload unrecognized ops back to TF. As a result, your purpose falls into this RFC and you can directly work on our POC with all our features reused.

zhiics · October 28, 2019, 9:17pm

@jonso Thanks for making these points and I am very glad to work together. Most of questions are answered by @comaniac. One thing is that putting extern in the target string might not be sufficient because 1) we need to change the way how target is parsed now, 2) what if there are multiple targets invoked? This may not quite possible for now. Directly adding an extern_target option in build of build_config might be simpler.

This is the tutorial we have for now:

github.com

zhiics/tvm/blob/partitioning/tutorials/dev/custom_relay_backend.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""

.. _tutorial-custom-relay-backend

This file has been truncated. show original

I plan to have an iteration on it to clean the branch and integrate the points we agreed here recently once the memory PR is merged.

jonso · October 28, 2019, 10:24pm

This is a really well-written tutorial very easy to understand.

I personally feel that the functions at python/relay/backend/op/contrib/gcc/extern_op.py are slightly overcomplicated the registration logic. It requires us to explicitly define all possible ops in python/tvm/relay/op/contrib/extern_op.py as well as their backend support in the subfolders. This can get a little confusing.

It seems like it would be easier for python/relay/backend/op/contrib/gcc/extern_op.py to register the operators directly. They can use decorator like @reg.register_extern_op("multiply", "my_external_compiler_name")
I also totally understand the point of having codegen be side-by-side with runtime module for developer simplicity, but it feels weird to have external runtimes be under the relay subfolder.

zhiics · October 28, 2019, 10:31pm

@jonso We actually have it

github.com

zhiics/tvm/blob/partitioning/python/tvm/relay/op/contrib/extern_op.py#L61


        extern_op = getattr(compilers[compiler], 'extern_op')
        if hasattr(extern_op, op_name):
            return getattr(extern_op, op_name)


logger.warning("%s in %s is not registered. Fallback to CPU", op_name,
               compiler)
return lambda x, y: False




@reg.register_extern_op("nn.conv2d")
def external_conv2d(attrs, args, compiler):
"""Check if the external compiler should be used.
"""
return get_extern_op(compiler, 'conv2d')(attrs, args)




@reg.register_extern_op("nn.dense")
def external_dense(attrs, args, compiler):
"""Check if the external compiler should be used.
"""
return get_extern_op(compiler, 'dense')(attrs, args)

I am also aware of your second point. The reason why I put it there is because I think putting it under tvm runtime is not good as it needs to take Relay expr/module as input. We can discuss where is a better place to put it.

tqchen · October 28, 2019, 10:35pm

Good discussions. Some high level thoughts.

I like the idea of separating runtime and compiler as it forces us to think more carefully about the runtime requirement(eg cannot touch compiler data structures)

We should avoid complicating build config and think about multiple target cases. One way is to make use of special function attributes that specifically tags the desired target of each function in the module

jonso · October 28, 2019, 11:35pm

@zhiics I mean enforce putting this registration inside of the subfolder (contrib/gcc/extern_op.py). In this way, we don’t have to worry about defining all possible external operators in the outer folder, instead letting individual libraries handle it themselves. I think that will save us some headache in the long-term.

@tqchen can you provide a little more detail on the multiple target case? If the overall goal is to automatically call PartitionGraph() based on external targets a user specifies, where would be the best place for the user to specify it? I suppose we can pass a list of targets to relay.build. These targets can be in order of preference.

Also, by module are you talking about the runtime module? If so, I definitely agree that we will need to add extra attributes so the runtime knows which module to call into.

comaniac · October 29, 2019, 12:29am

Our concerns to put the supported operators in each codegen are readability and maintenance. If we skip the central extern op module definition, we will need every extern op function to use decorator, similar to the current TOPI implementation:

@reg.register_extern_op("nn.conv2d", "gcc")

Considering most external codegen writers are not familiar with TVM codebase, this could be confusion and hard to be maintained by others. That’s why we choose the most straightforward approach. In fact, the current TOPI implementation is suffering from this problem as well, and we do have to plan to somehow get rid of it. In addition, even the writer wants to support an op that has never been supported and she forgets to put it to the outer module, it is easy to find out in at the early stage. More importantly, it provides a clear list to us about what kinds of ops can be supported externally.

jonso · October 29, 2019, 3:57pm

Got it, I didn’t realize that there was already some discussion on getting rid of this pattern in TOPI. In that case, the solution is fine with me

tqchen · October 29, 2019, 4:01pm

re multiple targets.

The main benefit of embedding these information into Module itself is to make the IR module self-contained. If we serialize and then load back a module with some functions already pre-specified to be built with certain backends(e.g. NPU), we don’t have to specify these additional context information in the build. It also prevents the options of relay.build from growing as we add more features.

By attributes I also meant attributes on the functions of IR Module.

zhiics · October 29, 2019, 4:24pm

@tqchen Just to double check, for multiple targets, you meant we still need to pass extern_target=xx to the build interface so that we can later attach to the subgraphs/functions, right? Or we add an attr field to Module and let users to annotate by themselves?

jonso · October 30, 2019, 2:10am

I’m also a little unclear here - the user shouldn’t set the external target of each op themselves, we should handle it automatically if the op supports it.

After solving this, maybe we can send out an initial PR for more specific code comments?