[RFC] Op based annotation for external codegen

Yes they are not placeholders for Tensors but literally “placeholders” for the API and argument names we will put based on this discussion.

I suggest you replace these “placeholders” by some temporary names that have real meanings. It’ll help people understand the proposal and choose which one is better.

Thanks for the suggestion. Updated.

@tqchen @haichen @jonso @janimesh would you help to vote for the proposals (#17) and share your thoughts? We would like to make a conclusion and send a PR tomorrow.

Everyone else are also welcome to vote and commit. Thanks.

It would be great to discuss altenative naming conventions. In particular, what will happen if we want to support hooking multiple external compiler annotations (say dnnl and xyzlib). The approach so far seems to be very coupled with a single compiler extension point.

Here is one strawman that I can think of which get around it by a bit

# register a dnnl specific target
reg.register("nn.conv2d", "target.dnnl", batch_norm_func)
# register xyzlib specific logic
reg.register("nn.conv2d", "target.xyzlib", batch_norm_func)

mod = relay.transform.AnnotateTarget(target=["dnnl", "xyzlib"])(mod)

One of the important design principle of the composable pass infra, this means all APIs should follow the principle of “IRMoule in, IRModule out” API style, and expose the compilation result as a Pass.

While it is tempting to introduce new API functions like build_extern(E1) or push the necessary transformations into the build_module(E2) function. I think we should keep the principle in mind and expose the compiler annotation as a separate (possibly composite pass) as in all APIs exposed under relay::transform

@tqchen Thanks for the comment. I personally also thought about supporting multiple compiler annotation as well. One of major problem that made me choose the single target was because it might be hard to decide which compiler we should use when ops are supported by them. For example, basic ops, like add and sub etc., would be supported by all of them. We might need to have a decision when annotating it to achieve good performance (by probably reducing the number of segments and data movement between accelerators through host).

But I agree it might be better to take a list at this point since it provides a super set of functionality. We can propose to solve the interesting optimal annotation problem later.

For the build_extern API, we are still implementing each pass (i.e. AnnotateCompiler) as IRModule -> IRModule. It is just an API used to run these IRModule -> IRModule passes together. build_extern should return a module that will be taken by relay.build. I personally don’t think build_extern is a good name because it is not actually compiling but just preforms some IRModule->IRModule transformations. Do you have any concerns/comments on this API including naming?

1 Like

Thanks for the clarifications. This discussion thread relates quite a few key design principles which I try to summarize as follows:

P0: Every (composite) IRModule transformation should be a Pass

As an analogy in deep learning frameworks. We tend to design every tensor transformations as nn.Module or a composite Layer. The build_extern function certainly is a IRModule transformation. However, it is not presented as a composite Pass.

The main advantage of presenting everything as a Pass is the uniformity, and further composition of the pass itself. For example, we could further push the BuildExtern(ignoring the naming issue here, we need a better name than “extern”) into another Sequential, which combines it with further optimizations.

The key insight is that having everything as a pass encourages uniformity and brings changes for further improvements. For example, we could develop a meta pass that tries a list of possible transformations, or search over them.

# Try each pass in order
opt_pass = transform.TrySequential(
   [BuildExtern("dnnl"), NormalPipeline(), BuildExtern("xyzlib")])
mod = opt_pass(mod)

P1: Always decouple target capability from dispatch strategy

Different target backends are likely going to be implemented by different people.

  • Target capability means “target X can do Y”. For example, “dnnl can dispatch conv2d”.
  • Strategy means “what to do given capabilities”. For example, in the case of external compilation, try to dispatch to dnnl first.

The current registration API mixes the two. This means that for each type of the backend, developers have to customize the same set of functions and can not mix and match target compilation capabilities. If we have a strategy that mixes xyzlib and dnnl in the future, we have to implement another xyz_plus_xnnl compiler somewhere.

The following straw man, on the other hand, allows different target capability registration sits in a specific backend folder. Note that I am not pushing for this specific convention, but would like to point out that this decouple principle is important and should be taken into consideration seriously.

# register a dnnl specific target
reg.register("nn.conv2d", "target.dnnl", batch_norm_func)
# register xyzlib specific logic
reg.register("nn.conv2d", "target.xyzlib", batch_norm_func)

In terms of the ["dnnl", "xyzlib]" example, I did not mean that we need the best strategy now(There could be multiple in the future), but instead we should design the target capability registration to be simple enough so that we can easily add new strategies in the future. This principle also relates to @haichen’s AutoTVM strategy design.

Meta-P2: Customization as a Natural Consequence of Existing Design

This point could be a bit controversial. In particular, I think we should try our best to eliminate terms like “customization”, “extension” or “extern”. Let me clarify what do I mean.

In the case of bring your own codegen. What we are really doing here is to design a Target that have a specific compilation flow to the runtime.Module. Target specific compilation flow is a term that could be designed internal to the project. Then bring your own codegen could simply become a way to define new Target and its compilation flow. We could also introduce a separate concept besides Target, if we think we need something else.

The name extern usually results in a single point of extension(here is a plugin API that you can put your rules). A good design in the infrastructure usually can results in multiple mix and match. And make the normal compilation flow and the custom one under the unified scheme.

In the case of build_extern, this is a specific function that provides a new feature. However, forcing it into an existing known term(Pass) will provide more uniformity and advantages for the project.

To summarize, these customization points should be natural consequence of the existing infra. If existing infra does not handle something well, we should think if we can improve it to encapsulate the things we need eventually.

Note that this is a meta point, and we do not have to do it now, but it would be great if we can collectively think how we can improve “extern”, “custom” as much as possible as part of the development process.

1 Like

Thanks for the summary. Accordingly, here is the updated annotation mechanism, and we will start working on a PR.

Updated E.2: Register a checker function to “target.<compiler_name>” attribute.

# In relay/op/contrib/dnnl.py
reg.register("nn.conv2d", "target.dnnl", lambda attrs, args: True)
# In relay/op/contrib/xyzlib.py
reg.register("nn.conv2d", "target.xyzlib", conv2d_checker)

In addition, we agree with you about the meta pass. Updated A.1:

mod = tvm.IRModule()
mod["main"] = # A Relay function

def BuildExtern(external_compiler, fuse_patterns):
    return transform.Sequential([MergeComposite(fuse_patterns),
							     AnnotateExternTarget(external_compiler),
								 PartitionGraph()])
opt_pass = relay.transform.TrySequential([BuildExtern("dnnl", dnnl_patterns), 
									      NormalPipeline(),
									      BuildExtern("xyzlib", xyz_patterns)])
mod = opt_pass()(mod)

We will then prepare the build extern sequence after we have consent to its name, and plan the meta sequential. We will also open an issue on Github to track the progress.

Name Voting (pick at most 3):

  • PreOptimize
  • TargetPacketize
  • TargetPartition
  • TargetEncapsulate
  • SubgraphEncapsulate
  • others (please specify)

0 voters

It would be really nice if we can brain storm a bit alternatives to “BuildExtern” :slight_smile:

how about Preoptimize, TargetPacketize (i.e. llvm has VLIW Packetizer), or Target/Subgraph Encapsulate (similarly used in XLA)?

1 Like

I agree with Tianqi in that we’re really just defining a flexible way to register and compile code for new targets. This pass is a way for us to run the graph in heterogeneous mode across the new targets.

@comaniac, could you also provide an example for how a composite function would be registered?

As for naming, my vote is for something along the lines of Target/SubgraphPartition or Target/SubgraphEncapsulate.

I supposed the composite function you meant is the pattern for composite merge pass to match to? If so, then it is exactly the fuse_patterns in the above example.

1 Like

Sorry, I mean register so the AnnotateExternal pass properly annotates the generated composite function. I’ll wait to see the implementation in the PR :slight_smile:

Oh I see. This is related to my response #6: [RFC] Op based annotation for external codegen.

In short, you cannot register composite function for now, because we haven’t provided custom op support. Alternatively, our annotation pass will visit not only call nodes but also function. When the pass visits a function, it checks the composite attribute in that function and determine if it should be annotated.

1 Like

@tqchen @haichen @mbaret @janimesh @masahi any ideas on the naming? Thanks.

Since we are converging on the register API, we will send a PR first for this and use sequential directly so that we could be unblocked. The BuildExtern alternative API change should be a separate PR because we also need to support Composite functions.

1 Like

Hi,

Finally, I was able to go through this discussion since this is a very relevant topic for me at the moment.

My understanding of this discussion is that the op based annotation will allow to offload via external codegen 1) all ops of a given type (e.g., Conv2D) or 2) all sets of ops that match a given pattern. However, I was wondering if in the future you think it would be possible that the annotation granularity is not just at the op level type like all Conv2D ops in a model, but only specific Conv2D ops that due to its size make sense to offload. I am bringing this point since for heterogeneous computing to be effective, obviously the offloading overhead has to be amortized by a fast execution on the accelerator, otherwise is better to execute the operator on the host. Then, there could be cases in which it is not profitable to offload a given convolution due to its configuration or due to characteristics of the target HW (e.g., limited memory to handle a big convolution). For now having an easy-to-use annotation mechanism is great for specific op types, but more flexibility in the future could be beneficial to enable a more fine-grained accelerator offloading approach.

I think this method should work for what you want, although maybe we could make it a bit clearer.

This annotation method

@reg.register("nn.batch_norm", "FTVMExternalCompiler")
def batch_norm(attrs, args, compiler):
    if compiler != "dnnl":
        return False
    return # check with attrs and args

allows you to not just say True of False based on the compiler, but also based on the attrs and args. You could query these to see the tensor shapes and add in a check to say it’s not supported if the input tensors are too large (or other things like if the kernel size isn’t supported).

2 Likes

Would be great to consider the context of [RFC][Relay] Program Matching for Relay Pt 1: A pattern language

With the new pattern language, we could hugely simplify MergeComposite. There’s potentially an argument we wouldn’t need the pass at all if we considered everything as patterns in the annotation pass (with single op patterns where necessary). I don’t think we should block progress waiting for it, but we’ll definitely have to consider where it can simplify the implementation when it arrives.