[RFC] naming and API signature for external compilation

zhiics · January 8, 2020, 6:29pm

We have already merged the codegen and runtime PRs for “Bring Your Own Codegen to TVM”. Another piece of this work is allowing users to conveniently annotate the given Relay program and then partition it into regions (sub-functions) that will be handled by various compilers (i.e. the default ones in the TVM stack and external ones), https://github.com/apache/incubator-tvm/pull/4570.

In order to do this, we mainly have two passes, one for annotation and the other for partitioning. Annotation allows users to annotate Relay expressions with boundaries (i.e. compiler_begin and compiler_end). It indicates that this annotated region (i.e. it could a single CallNode contains an operator) should be handled by the provided compiler. After annotation, we invoke the partitioning pass to clean the annotation, extracts the annotated regions and packs them into sub-functions that will use external codegen.

Here, we’d like to get some thoughts/comments from the community about some of the naming and API designs.

The API for annotation and partitioning, there are a few design choices

Directly invoke the passes

mod = relay.transform.AnnotateExternalCompiler("xx")(mod)
mod = relay.transform.PartitionGraph()(mod)
graph, lib, params = relay.build(mod, target="llvm")

Of course, these two passes could be packed as a SequentialPass and applied together.

Have a separate build pipeline for such a case

def build_external_compiler(target="xx"):
   mod = relay.transform.AnnotateExternalCompiler("xx")(mod)
   mod = relay.transform.PartitionGraph()(mod)
   return mod

mod = relay.build_external_compiler(target="xx")
graph, lib, params = relay.build(mod, target="llvm")

Let users set the attribute of a function and invoke annotate with no parameters

func = func.set_func_attr("compiler", "xx")
mod["func"] = func
mod = relay.transform.AnnotateExternalCompiler()(mod)
mod = relay.transform.PartitionGraph()(mod)
graph, lib, params = relay.build(mod, target="llvm")

Among these three options, option 1 seems most obvious, option 2 hides the external pipeline to users, and option 3 seems more extensible than the others (i.e. if we want to have multiple external compilers), please share your opinion. Or if you have other options, please bring them up.

Naming
1. AnnotateExternalComipler. seems not very good, what are other good names? AnnotateCompilationBoundaries? AnnotateCodegenRegion? AnnotateCompiler?
2. PartitionGraph. We know @jared doesn’t like this. We kept it because this term is also widely adopted by other frameworks.

Please share your thoughts.

@tqchen @jroesch @yzhliu @haichen @comaniac @masahi @ramana-arm @jonso

masahi · January 6, 2020, 3:33am

I want to see the effect of the AnnotateExternalCompiler pass too, so I won’t be using build_external_compiler. Besides, this function can be implemented by users as “userspace” utillity if they find it useful.

zhiics · January 6, 2020, 7:07pm

cc @thierry as well.

zhiics · January 8, 2020, 5:26pm

I would probably go for the option1 because option 3 would be a little confusing as we may want to annotate the sub-functions in a function instead of the function itself. For complicated annotation, customized annotation pass is probably still the more favorable way.

thierry · January 8, 2020, 11:01pm

Thanks for sharing your design thoughts on the naming and API signature. I like option 1 as well since it’s quite clean.

In terms of naming, I’d like the AnnotateExternalCompiler. PartitionGraph is indeed a misnomer but is widely understood among SysML folks.

zhiics · January 9, 2020, 6:47am

@thierry @masahi Thanks for sharing your thoughts.

BTW, I think I forgot to talk about the register_annotate_compiler interface.

We introduced it for different vendors to decide if an op should be offloaded to their codegen. We provide users this template so that they only need to focus on their operator list and decide if they want to generate code using their own compiler for a specific op.

The dispatcher is like the following:

@reg.register_annotate_compiler("nn.conv2d")
def annotate_conv2d(attrs, args, compiler):
    """Check if the provided compiler should be used for conv2d.
    """
    return get_annotate_compiler(compiler, 'conv2d')(attrs, args)

get_annotate_compiler will check if the op is decided to be code-generated using the provided compiler under the compiler namespace so users only need to implement the following code for the interested ops (otherwise fallback to TVM backend):

def conv2d(attrs, args):
    """Check if the external codegen should be used.
    """
    if args and/or attrs contain certain pattern:
        return True
    else:
        return False

We found this is useful because many vendors only need to work on their own namespace to implement the logic while the dispatching, annotation, and partitioning are transparent to them. This simplifies the widely used whitelist-based approach. For example, they may otherwise need to provide a list of ops that require the external compiler and feed them to the annotation and partitioning pass (this is the most common case we’ve seen when talking to the vendors). This seems not as clean as the template approach.

We currently only have two namespaces, one for dnnl and the other is used by some simple C compiler compatible ops. A namespace for TRT can be added later when we work on the integration.

Does the API sound good to the community?

@comaniac please add anything that you think I’ve missed here. We highly appreciate any thoughts and/or comments.

comaniac · January 9, 2020, 7:40am

I prefer option 1 as well for its clean and extensibility. For the annotation methodology, we know that some people may bring an issue of supporting op fusion, as an external compiler (e.g. MKLDNN) may also need op fusion to achieve better performance. Here are 3 approaches for supporting op fusion:

[Manual Fusion] This is the simplest and already supported solution. Recall that our current implementation already supports customized annotators, meaning that a vendor can implement her annotator to cover multiple ops in a pair of compiler_begin and compiler_end for the external compiler.

[Semi-Automated Fusion] Alternatively, we could design a simple but general programming model for vendors to specify fusion patterns, and leverage graph pattern matching algorithms to fuse ops. This is practical but requires brainstorming for the programming model design, so we put it to the next step.

[Fully-Automated Fusion] Ideally, our graph partition pass should magically figure out which ops should be fused for a specific external compiler. This is difficult, however, due to different fusion rules required by different external compilers. We may need another AutoTVM to search for all possible fusions.

For the above 3 approaches, we have supported the simplest one, and the register_annotate_compiler interface @zhiics mentioned is also the first step toward to the rest two approaches. We will start investigating semi-automated fusion after the annotation PR has been merged. Consequently, this RFC and its corresponding PR focuses on manual fusion (customized annotator) and op-level annotation.

Leo-arm · January 9, 2020, 8:51am

@comaniac This is an interesting point. I can envision an accelerator that combines a large number of operators into a stream of commands which are then handed off the to accelator in one operation. This means there is no simple one-to-one mapping or even a few to one mapping; the partitioning could potentially gobble up the the entire graph. It might be good to start having a discussion around this soon.

A separate issue is that after an initial annotation, a compiler may fail to compile the annotated part of the graph. This could be due to supporting a specific operation but not in combination with previous or subsequent operations. As pointed out, an AutoTVM-like solution where different combinations are tried could potentially deal with this. Are there any thought on how this might work?

comaniac · January 9, 2020, 9:08am

We haven’t had a clear picture about it. One of the most important reasons is the lack of driving applications. I haven’t seen an accelerator that performs well with dramatically different fusion behaviors for different models. Most cases are rather straightforward (e.g., always fuse conv2d+bias_add+relu+pooling). We will need to first identify some practical cases to drive the methodology design, but it’s a bit out of scope of this topic. Like you wish to start such a discussion soon, we also want to move one step forward ASAP. That’s why we eager to collect the naming comments in this RFC.

zhiics · January 9, 2020, 5:12pm

A separate issue is that after an initial annotation, a compiler may fail to compile the annotated part of the graph. This could be due to supporting a specific operation but not in combination with previous or subsequent operations.

@Leo-arm This is also one of the reasons why we want to provide the the template. This way vendors can look at the inputs (e.g. args) and decide if they want to use their own codegen for this op.

Leo-arm · January 9, 2020, 5:34pm

The template approach is quite good for one-to-one mappings, i.e. the relay operator maps directly to the external operator. Ideally we have something similar for multiple relay ops to a single external operator mapping. Our case is more complex still because we can string a set of operators together, potentially the entire graph. There are several approaches here but a more complex annotator similar to the template could work. The added complexity in our case is that we won’t know if the sequence of operators will work until we compile all of them, at which point a failure means a failure of the build without a means to make a change and retry.

Comaniac is correct, we should probably have a separate design discussion on this as this strays from the topic somewhat.

On that, the name “template” I find confusing but a better name escapes me right now. “production” might apply for many to one mappings, less so for one to one.

zhiics · January 9, 2020, 5:47pm

@Leo-arm make sense. Let’s have the the consensus on the naming and the API first. We will have follow up discussions/PRs to consolidate partitioning.

masahi · January 9, 2020, 9:55pm

@comaniac told me in the PR that register_annotate_compiler is one of two alternatives for triggering annotation. Shouldn’t we discuss a custom annotator approach too?

Correct me if I’m wrong, but the latter approach is the one I should use if I want to have “subgraphs with more than one ops” (as @comaniac commented in the PR above), so if backend vendors wants op fusion (which is most likely true), they would be using the latter approach, no? It seems to me that a custom annotator approach will be used more often if this is the case.

zhiics · January 9, 2020, 10:37pm

@masahi Both methods could achieve it. register_annotate_compiler works at the op level, meaning we will have compiler_begin and compiler_end for each interested operations. We then need to have a pass (like fusion) to intelligently fuse the ones with back-to-back annotations. This would need less interference with developers (but they may need to provide some general fusion rules).

Another one, the the latter approach where users can write a separate annotation pass to put these boundaries in the graph. This is like a more advanced approach which requires users have sufficient knowledge about how to annotate and what they can fuse if they want to deliver good performance, i.e. it is more like manual fusion.

zhiics · January 9, 2020, 10:54pm

@masahi let me clarify a little more. Basically, we support:

mod = relay.transform.AnnotateExternalCompiler("compiler_name")(mod)
mod = relay.transform.PartitionGraph()(mod)

and

mod = relay.transform.CustomAnnotationPass(mod)
mod = relay.transform.PartitionGraph()(mod)

The former works at each operator level. register_annotate_compiler helps it by providing a template so users only need to decide if the operator should use external codegen or not. We might need to have some rules to fuse the operators together to have a large subgraph.

The latter expects vendors to provide the annotation pass so that we only focus on the partitioning part.

Hopefully, this explains it more clearly.

I think the question is if we want to provide them with the first option. If so, is the provided API reasonable or not?

comaniac · January 9, 2020, 11:11pm

Thanks for the clarification. Here are two motivations for providing the first option in this RFC:

We already have some vendors who only have a whitelist of ops that they want to offload to their devices. They don’t have any fusion support and requirements. With the first option, they only need the minimum effort.
Writing a custom annotation pass could be difficult for new TVM developers. We consider that a vendor who is new to TVM will prefer to first make the integration working and then consider the performance optimization. The first option provides a very simple starting point to keep them in a loop. After they get it work and find that they need fusion to improve the performance, they could learn the custom annotation approach.

masahi · January 10, 2020, 1:04am

Thanks for the clarification, I think I got it right.

But I suggest thinking more about naming around “customization”. From backends purspective, both approaches are customization in different sense, one is op level true/false while the other is full blown whole graph customization requirng visitor class. But in the current API, only the latter is referred to as “custom annotation”. I think if people havent looked at the PR they dont even know what we are talking about.

zhiics · January 10, 2020, 1:42am

Yeah, you are right. “customization” is only used in this discussion for illustration. I think we didn’t really put it in the code base because that pass requires users to implement.

zhiics · January 12, 2020, 4:53pm

okay, so let’s summarize the discussion so far.

we will directly use option1 for build pipeline

mod = relay.transform.AnnotateExternalCompiler("compiler_name")(mod)
mod = relay.transform.PartitionGraph()(mod)

graph, lib, params = relay.build(mod, "llvm")

We will keep the template for annotation, and the interface will still be like the one used in the PR.

@reg.register_annotate_compiler("nn.conv2d")
def annotate_conv2d(attrs, args, compiler):
    """Check if the provided compiler should be used for conv2d.
    """
    return get_annotate_compiler(compiler, 'conv2d')(attrs, args)

We will update the PR by tomorrow.

tqchen · January 12, 2020, 8:08pm

It would be great for us to discuss register_annotate_compiler interface for a bit more. From what I see, it consitutes a very important API design decision about how to add modular op-level customization.

Given that this posts focused quite a lot on the choice of frontend interface, perhaps we can also open another post for it(the backend interface register_annotate_compiler).

Some initial thoughts:

In terms of API name, I find the term annotate_compiler a bit confusing, is it about capability?
Like the case of frontend API, it would be great if we can list a few alternatives, with APIs listed to them.

In a nutshell, the discussion seems to be about how can a backend expose their capabilities to the graph optimizer. I can see a few choices design space:

Name of API: it is quite important, I still find the term register_annotate_compiler quite confusing.
A bulk interface that provides a list of all ops
A per op query interface, this is a way similar to the current proposed approach, however, there can still be a few ways: e.g. encode per op-level op in a compiler specific prefix dnnl.nn.conv2d and query that one.
The query then select approach is quite similar to the strategy interface by @haichen , would be nice to discuss the relations.