[RFC] Op based annotation for external codegen

[RFC] Op based annotation for external codegen

Background

We (@zhiics and @comaniac) have merged the major infra and tutorials of bring-your-own-codegen. Currently, we expect users to write a custom pass to annotate a Relay program and then send it for partitioning. This offers great flexibility. However, one feature we removed from the infra is op based annotation, which allows developers/vendors to only specify if an operator is supported by their own codegen. Therefore, the annotation work is maintained by TVM so the required work from developers/vendors is minimized.

For example, we can have various methods to merge operators to form a subgraph that can be offloaded to external codegen. A straightforward approach is merging the ops in a greedy manner so that we offload a as large as possible subgraph to an external accelerator/backend. By doing this, we can ease the effort from vendors, but of course, we still allow them to bring their own pass to annotate a program if they do not satisfy with the greedy approach.

Considerations

The reason why we removed it from previous BYOC PRs was because we were thinking about how to integrate it with the op strategy (PR #4644), and 2) a proper interface to developers/vendors.

After consideration, we concluded that graph annotation for external codegen is actually orthogonal to op strategy as graph annotation is done before vendor-independent (like constant folding etc) Relay passes have been executed. Op strategy, on the other hand, mainly focuses on selecting compute and schedule functions for ops that have been determined to be compiled by TVM. As a result, op strategy happens after all passes are complete.

Also, the op based annotation is orthogonal to the recently merged composite pass (PR #4771) which accepts user-written Relay graph patterns and uses pattern matching to make subgraphs. Our proposed op based approach simplifies the work from users to write possibly countless patterns.

Implementation in TVM

For each op, we have a corresponding check function registered and the checker will be invoked at the compilation time to indicate if we should annotate the op for the 3rd party accelerator to offload. For example, the following code shows an implementation of the helper for checking if an op should be offloaded to a given compiler:

def register_external_compiler_helper(op_name):
    @reg.register_external_compiler(op_name)
    def _register_wrapper(attrs, args, comp):
        return get_external_compiler(comp, op_name)(attrs, args)
 return _register_wrapper

# Register all Relay ops
register_external_compiler_helper("nn.conv2d")
register_external_compiler_helper("nn.relu")
register_external_compiler_helper("add")
...
  • Note that comp is a module name in the 3rd party compiler (e.g, the name dnnl indicates the module python/tvm/relay/backend/contrib/dnnl/dnnl.py); the get_external_compiler uses hasattr and getattr to obtain the 3rd party specified checkers (implemented in the following section).

Required Implementation by Developers/Vendors

  • For HW partners/3rd party library, they only need to attach an op to a provided helper or they can also implement a simple checker for an op to specify if they could support it under certain conditions. For example, they can do any of the following:
    _register_external_op_helper("conv2d")
    _register_external_op_helper("conv2d", True)
    _register_external_op_helper("conv2d", False)
    
    def conv2d(attrs, args):
       if ...:
           return True
       return False
    
    • Where _register_external_op_helper("conv2d") is a simple version of
    def conv2d(attrs, args):
        return True
    
    • Note that HW partners do not need to register this function but just need to implement it under python/tvm/relay/backend/contrib/compiler_name/comp.py so that the function can be discovered and imported dynamically.
  • A Relay IR pass, AnnotateCompiler, in Python will invoke above function, insert annotations to the graph, and run the greedy algorithm (TBA) to merge ops.

How to Use

mod = relay.IRModule()
seq = Sequential([transform.AnnotateCompiler("dnnl"), transform.GraphPartition()])
mod = seq(mod)
graph, lib, params = relay.build_module(mod, "llvm")

Any comments are highly appreciated:)

cc @tqchen, @masahi, @mbaret, @jonso, @haichen

3 Likes

Thanks for this proposal! This looks to broadly be the correct direction to me. Iā€™ve got a few initial comments.

  1. Iā€™m not sure if itā€™s generally true that we want to perform the partitioning before any Relay pass is executed. For example, the QNN Relay dialect produces qnn.conv2d which later gets lowered to a series of ops that includes a normal conv2d. We may have an external codegen that canā€™t do qnn.conv2d, but does support conv2d and so would prefer to annotate the graph at a lower level. Another example of this is that some external codegens will want to see the actual values of the weights and so need to happen after BindConstantParams has been called. Both of these suggest to me that the partitioner cannot exist entirely outside of relay.build_module.

  2. Will we allow composite functions to be treated as ā€˜opsā€™ for the annotation? This is necessary for the case where a series of operators are supported but the operators on their own are not.

  3. Iā€™m interested to know the use case for custom annotation at this point. If we got rid of it, we could just directly call the check function in the graph partitioner which would be a lot simpler.

  4. Do you think we could perform all the heterogenous partitioning in this pass (eg. GPU/CPU offload)?

Overall, the design looks good to me.

I think it is not safe to assume that annotator will be called before any other Relay passes. The flow can look something like this

  • Pre-partitioning passes - This might also include Relay passes that are not part of TVM codebase, and are very specific to hardware, but still need to be done on whole graph. Basically, we need infra to specify a user-defined sequence of Relay passes that can be run before we call partitioner.
  • Annotation/partitioning
  • Post-partitioning - For each subgraph, run different sets of Relay passes. This can be hidden inside the External Codegenā€™s frontend.

@mbaret @janimesh Thanks for your quick response.

Yeah, letā€™s clarify it a bit. We can classify the passes into at least two categories in this situation, vendor-dependent (like fusion, layout transformation, etc) and vendor-independent (like constant folding etc) passes. Annotation/partitioning should happen before vendor-dependent passes, but it for sure could be after the vendor-independent passes.

I wonder whether the vendor-dependent passes need to run first. It might be simpler if they were run after partitioning on the partitioned subgraphs so that they donā€™t mutate parts of the graph that end up going through TVM codegen.

I also wonder whether it might be appropriate to split the lowering into two phases. The first phase will just lower the graph in a way that conserves information and this is where you can insert your annotation passes. The second phase will be after partitioning where backend specific lowering is done depending on where the subgraphs were offloaded. All of the passes that destroy information can be moved after partitioning (like the ā€˜combine_parallelā€™ passes).

  • Will we allow composite functions to be treated as ā€˜opsā€™ for the annotation? This is necessary for the case where a series of operators are supported but the operators on their own are not.

The composite functions will not be treated as ops for op-based annotation, because all ops have to be registered in advance. We may implement op-based and function-based annotations in the same pass, and let function-based part deal with composite functions.

  • Iā€™m interested to know the use case for custom annotation at this point. If we got rid of it, we could just directly call the check function in the graph partitioner which would be a lot simpler.

We will keep custom annotation as an option for developers to make sure the annotation mechanism is capable of covering all possibilities.

  • Do you think we could perform all the heterogeneous partitioning in this pass (e.g., GPU/CPU offload)?

I think thatā€™s doable. @zhiics do you have more to comment?

I also wonder whether it might be appropriate to split the lowering into two phases. The first phase will just lower the graph in a way that conserves information and this is where you can insert your annotation passes. The second phase will be after partitioning where backend specific lowering is done depending on where the subgraphs were offloaded. All of the passes that destroy information can be moved after partitioning (like the ā€˜combine_parallelā€™ passes).

Although this proposal seems reasonable to me, Iā€™ll pass this to zhiics as he worked on the pass manager.

One interesting thing we can see here is that we start to introduce op-related attributes for specific pass(AnnotateCompiler).

It would be great to discuss the potential API design and namespacing.

For example, given that the AnnotateCompiler pass is special to the ā€œdnnlā€, it would be useful to keep some of the attributes local to dnnl.py and only makes use of set_attr and get_attr in op as a generic API.

So the current proposal seems not too different from the original implementation that was removed in the annotation PR, is that right?

I also want to hear more about how this will play with composite functions. Op based or function based, composite functions need to be treated the same way as any other op during the annotation pass (That was the original motivation). So if the function based pass is supposed to deal with composite, it needs to deal with other regular ops like the op based one would do, so we might as well implement only the function based pass.

+1 on having the pass also deal with composite functions - having a unified way to offload both composite functions and single ops to external runtimes would be hugely helpful!

1 Like

@masahi @jonso Yes, we can also handle composite function in the AnnotateCompiler pass as we can check if the callnoodeā€™s op is a composite function or not.

@mbaret It is possible but it needs more thoughts as CPU/GPU heterogeneous execution is currently taken care of by TVM codegen and runtime which is part of TVM build pipeline.

@mbaret We can separate the passes into two stages. It needs some refactoring.

@tqchen Could you please clarify a bit more about the set_attr and get_attr stuff? I think I donā€™t fully understand. Thank you.

I think what we really need is a more principle approach to define composite ops and register them as regular ops to Relay. After having these composite ops, we can then register attributes to them, such as whether to expand them to elementary ops and whether to use external codegen or external library. This can solve many problems. For example, softmax issue can also be solved using this approach. And of course, we can add customized composite ops/rules for different backends.

The API _register_external_op_helper was quite specific to the AnnotateCompiler. I wonder if we can reduce to use less amount of abstraction, and instead use https://docs.tvm.ai/api/python/relay/op.html?highlight=op#tvm.relay.op.register directly with a clear attr name

@haichen Yes, thatā€™s probably the cleanest approach I can think of as well. Just talked to @comaniac with the same idea.

@tqchen Thanks. Letā€™s take a look first.

One thing Iā€™m interested in is how similar this will end up looking to the device annotation mechanism thatā€™s already present. The external codegens are going to mostly be tightly coupled to a particular device. In that case this looks to be an extension of the device annotation. For example, we might offload to GPU using ACL/TVM/TensorRT or to CPU using DNNL/ACL/TVM or to an accelerator using some custom compiler.

1 Like

This is a fair consideration to me, but we need more thoughts and refactoring as @zhiics mentioned. Specifically, TVM now supports 3 kinds of build pipeline:

  1. TVM codegen (codegen/build stage): This includes LLVM, CUDA, etc. The CPU/GPU heterogeneous execution is also handled in this build process.

  2. Third-party libraries (lowering stage): This includes CBLAS, CUBLAS, CuDNN, etc. Developers manually map Relay ops to corresponding library functions. The mapping is implemented in contrib and will be triggered when lowering.

  3. BYOC (Relay stage).

As can be seen, those 3 pipelines happen at different stages, so we cannot determine the optimal device offloading policy easily. We need to first plan to put all mechanisms to the same stage so that we can work on this issue.

2 Likes

If it looks to be a significant refactoring effort, then maybe we can consider it out-of-scope for this RFC. But I think we should ensure we arrive at a design that is flexible enough that we keep the option of unifying the 3 pipelines open in the future.

As another quick point relating to API, this is what I currently have to call to get the partitioning to happen:

f = relay.build_module.bind_params_by_name(mod["main"], params)
mod = tvm.IRModule()
mod["main"] = f
pattern_table = get_pattern_table(external_compiler)
mod = relay.transform.MergeComposite(pattern_table)(mod)
mod = relay.transform.AnnotateCompiler(external_compiler)(mod)
mod = relay.transform.PartitionGraph()(mod)

I think this unnecessarily exposes a lot of the implementation to the API when all the user really cares about is that their external codegen is used as part of the partitioning. Maybe we can move all of these passes inside the build function and provide a relatively clean API along the lines of relay.build_module(mod, params, external_codegens=["acl", "dnnl"])?

Thanks everyone for such valuable discussion. Accordingly, we came up with several proposals of API designs for op-based annotation. We will have seperate discussions for other issues related to BYOC but not directly to the op-based annotation.

Annotator Implementation

Annotator will be implemented by developers/vendors under python/tvm/relay/op/contrib/<compiler_name>/external_compiler.py. To simplify, we use dnnl as an example compiler name in the rest of this post.

A.1: A New Register Helper.

This is the original proposal. We ask developers to implement a checker for each op to indicate if we should annotate that op or not:

def conv2d(attrs, args):
    return True

If the checker simply needs to return True or False, we provide a helper to reduce the efforts:

_register_external_op_helper("conv2d")

These two implementations are fully equivalent and developers can use either way they like.

A.2 Use tvm.relay.op.register Directly

Suggested by @tqchen, we can reuse the Relay op register instead of introducing a new register:

from ... import op as reg

reg.register("nn.conv2d", "FTVMExternalCompiler", lambda r, g, c: c == "dnnl")

@reg.register("nn.batch_norm", "FTVMExternalCompiler")
def batch_norm(attrs, args, compiler):
    if compiler != "dnnl":
        return False
    return # check with attrs and args

The most important benefit of this approach is that we do not introduce any new APIs. On the other hand, developers have to write one function per op. Of course, we can still add the following practice to the tutorial:

def _reg_op(op_name, supported=True):
    @reg.register(op_name, "FTVMExternalCompiler")
    def _func_wrapper(attrs, args, compiler):
        return supported if compiler == "dnnl" else False
    return _func_wrapper

_reg_op("nn.conv2d")

End-User Interface

For end-users who know nothing about annotation and external codegen, we have the following options to put it all together:

E.1: A separate API

The first approach asks users to explicitly call a special API we provided to perform graph partitioning:

mod = tvm.IRModule()
mod["main"] = # A Relay function
mod = relay.build_extern(mod, external_compiler="dnnl", patterns=[pattern1, pattern2, ...])
relay.build_module(mod, params)

where build_extern is an API calling MergeComposite (if patterns is not empty), AnnotateCompiler, and PartitionGraph sequentially.

The advantages of E.1 are high flexibility and extensibility. The drawback is an explicit function call. Note that since some passes such as constent folding and QNN legalization should be called before our passes, we will have a follow-up RFC discussing how to separate them from build_module if E.1 gets in.

If you vote for this approach, please also vote for the names of each API and argument. We provide some candidates for each of them.

  • build_extern
    • partition_module
    • extern_preprocess
  • external_compiler:
    • ext_compiler
    • compiler
  • patterns:
    • composite_patterns
    • fuse_patterns

E.2 Integrate to build_module

Another approach suggested by @mbaret is integrating those passes to the current TVM build process. Specifically, users only need to write one line of code:

relay.build_module(mod, params, external_compiler, patterns)

The advantage of E.2 might be a simpler programming model. The drawback is that it needs more changes in the TVM build module and compile engine. Again, if you vote for this approach, please also vote for the names of each API and argument.

Please vote for both A and E approaches and the names of API and arguments. You are also welcome to and share your thoughts. Thanks :slight_smile:

def _reg_op(op_name, supported=True):
    @reg.register(op_name, "FTVMExternalCompiler")
    def _func_wrapper(attrs, args, compiler):
        return supported if compiler == "dnnl" else False
    return _func_wrapper

_reg_op("nn.conv2d")

is quite clean. I am also okay with reg.register("nn.conv2d", "FTVMExternalCompiler", lambda r, g, c: c == "dnnl"), but it needs a bit more boilerplates.

For placeholder1, I prefer build_extern a bit more, but this is actually not build. It is more about some preprocessing of graph.

For placeholder2 and placeholder3, I would vote for compiler and fuse_pattern, respectively. We may want to avoid directly using pattern as Relay has Pattern and PatternNode already.

Regarding A, I think it will be quite valuable to have composite functions and operators registered for annotation via the same (or a very similar) API, even if under-the-hood we use a different mechanism to support them. This to me makes an A.1 style approach preferable as thereā€™s more flexibility to have it behave differently for functions vs operators.

For E.1, Iā€™d just add the drawback that it implies all partitioning takes places on an unlowered Relay graph. I donā€™t think this is generally true (at least it hasnā€™t been for me), particularly when considering QNN operators.

Even we adopt A.2 we can still support composite function registration by treating a composite function as a custom op. Since, the A.1 implementation also makes use of relay.op.register, so these two approaches basically have the same mechanism.

For the problem you mentioned in E.1, correct me if I misunderstood anything, you can still run any pass before the external build pipeline, so E.1 is just a handy API for end-users to invoke external compiler flow.