Relay pattern replacing with custom relay ops?

Hello everyone,

I know most of the usage of Relay’s pattern language has been to create “MergeComposite” functions to offload to BYOC implementations.

  • Q1: I think I remember reading somewhere “MergeComposite” regions are ignored by Relay->TVM lowering process. Is this generally true or is it only true for those which are annotated with an ‘external compiler tag’?

I want to use Relay’s pattern language to basically do the same as the BYOC examples, but following the Relay->TVM lowering process. I looked at the Pattern Rewritting documentation, which seems to offer the “find pattern and replace by” infrastructure. Now one thing I realized is that the return of the callback is a “standard” relay.op. This makes sense in the example since the Relay program is the decomposed form of batch norm and the replacement pattern is a known Relay function.

In my case, I want to replace a “standard” Relay subgraph with a call to a Relay operator that I have implemented with TE (i.e. using te.compute). This would be similar to operator fusion, but I want to fuse operators that TVM won’t fuse. An illustrative example would be if I want to fuse a complete inception block.

  • Q2: Is this a misuse of the Relay level pattern matching infrastructure? should I be using something else?

I seem to have a problem getting it to work though, because the process of registering (?) a new Relay op is not clear after reading Add Operator To Relay Tutorial. Especially, because I thought there was a way to not have to define the Relay operator at the C++ level.

  • Q3: Is there a way to register a new Relay operator using only the Python API?

I did a small trial an error script trying to reinvent the conv2d Relay node. It doesnt work since I return a Tensor (which is given by the compute inside my_conv2d) and it expects a Relay operation.

import tvm
from tvm import relay
from tvm.relay.dataflow_pattern import *
from tvm import te
from tvm import topi

@relay.op.register_compute("relay.op.my_conv2d")  #Q3 ==> why doesn't this work?
def my_conv2d(data,kernel):
    ishape = topi.util.get_const_tuple(data.shape)
    kshape = topi.util.get_const_tuple(kernel.shape)
    oshape = (ishape[0], (ishape[1] - kshape[2])  + 1, (ishape[2] - kshape[3])  + 1, kshape[0]) #I know this arithmetic is not complete

    kh = te.reduce_axis((0, kshape[-1]), name='kh')
    kw = te.reduce_axis((0, kshape[-2]), name='kw')
    ic = te.reduce_axis((0, kshape[-3]), name='ic')
    res = te.compute(
        oshape,
        lambda CONV2D_N, CONV2D_H, CONV2D_W, CONV2D_C: te.sum(
          data[CONV2D_N, CONV2D_H+kh, CONV2D_W+kw, ic] *
          kernel[CONV2D_C, ic,kh, kw ],
        axis=[kh, kw, ic]),
        name="res",tag="my_conv2d")

    return res

def min_relay_prog():

    x = relay.var('x', shape=(1, 224, 224, 3))
    w = relay.var('w', shape=(16, 3, 3, 3))
    
    conv2d = relay.op.nn.conv2d(x, w,data_layout="NHWC")
    return conv2d

class MyPattern(DFPatternCallback):
    # A callback class to rewrite the matched pattern to a my_conv2d
    def __init__(self):
        super(MyPattern, self).__init__() #This line is missing in tutorial
        self.pattern = self.look_for_pattern()

    def look_for_pattern(self):
        conv2d_p = is_op('nn.conv2d')(wildcard(), wildcard())
        return conv2d_p

    def callback(self, pre, post, node_map):
        return my_conv2d(te.placeholder((1, 224, 224, 3)),te.placeholder(shape=(16, 3, 3, 3))) #Error Expected RelayExpr got Tensor
        #return relay.op.my_conv2d(...) relay.op doesnt have my_conv2d attribute
        


template = MyPattern()
prog = min_relay_prog()
print(template.look_for_pattern().match(prog)) #Pattern matches
from tvm.relay.dataflow_pattern import rewrite
out = rewrite(template, prog)
print(out)
  • Q5: the tutorial has in the callback method the following expression return relay.op.nn.batch_norm(x, gamma, beta, mean, var, epsilon = eps.data.asnumpy().item())[0]. Why do we need the [0] & how would we handle multi-output Relay operators?

Just in case you missed it, in the __init__() we have to call the super. Otherwise another error is raised about missing attributes (in this case require_datatype)

  • Q6: What is this require_datatype attribute and when would we need to set it to something specific?

Sorry for the long post.

For Q1 and Q2, it seems to me that what you really want is to improve the fusion at TIR level. It’s not the same case as MergeComposite because BYOC offloads Relay graphs to customized codegen without lowering. Even you use Relay pattern language to partition matched subgraphs to Relay sub-functions, you still need a mechanism to recognize what pattern you have matched at the TIR level, so this flow seems redundant.

A more straightforward solution is like what you mentioned, you can register a new Relay op and replace matched subgraphs with the op at the Relay level, similar to the idea of combining parallel ops (https://github.com/apache/incubator-tvm/pull/5791).

For Q3, I have no experience of registering a new Relay op yet.

(Q4 is missing?)

For Q5, the reason you need [0] is because the output of batch_norm is a 3-element tuple. The rest two are the new mean and variance. This also answers the second question mark: multiple outputs will be formed to a tuple node. As a result, the output of Relay ops and functions is always a node.

Thanks for taking the time to reply. You are right Q4 was missing.

Yes this is correct. There are some operator chains which are not fused by the “standard” fusion algorithm in TVM and I want to investigate how they would perform. I thought about implementing another fusion pass, but decided to try this pattern language functionality first.

If I remember correctly there is an attribute of the mergedcomposite which states the pattern that it merged from. Is this what you mean by “mechanism to …”?

Will you be looking into this? I have another idea, but it doesn’t sound solid in my head.

  • Q7: Technically the return of the callback is a TupleGetItemNode object. The first entry in the tuple is a CallNode, while the second is the index of the output we want to “get”. Meaning that if I can create a CallNode, from Python, with my operator then I can recreate the datastructure I am supposed to return. This doesnt necessarily require me to register it to Relay. The only thing I am missing is a way to link that CallNode to its corresponding compute and schedule rule. But that also doesnt require Relay registration. Does it?

I think a fallback method I would try is, reuse the MergeComposite and “trick” TVM by offloading it into another TVM lowering phase and returning what it would expect from an “external codegen”.

  • Q8: Do you think this makes sense? (or at least more than the point above?)

Note that the purpose of current TIR fusion is to reduce memory transaction instead of aggregating the computation. Thus, TIR fusion doesn’t have to (and cannot) replace a pattern with an op but just put them into a TIR function to indicate that the intermediate tensors won’t be used by others and can be safely thrown away after the TIR function is finished.

To achieve your goal, you can get the matching information by looking at that the composite function attribute as you pointed out, but this does not solve the problem in a concise way. It seems to me that if you have a new fused op can replace a Relay subgraph, it’s more like a graph optimization and should be done in the Relay level.

cc @mbrookhart

So if I understand correctly, operator fusion creates a multistage operator. At this point it’s still te. The stages will then be lowered (?) to TIR. This is still correct right?

You are right. Having a multistage operator doesn’t mean that the schedules have any compute_at statements which will actually create a fuse schedule which can have smaller “code distances” between producers and cosumers. I guess even if I had a way to solve it with MergeComposite, it would still requires a way to signal that I want to insert compute_at statements.

I think I agree with you. This would require me to actually implement this fused op as a “native” Relay operator, register it and implement its Topi link to a schedule wich has a different structure than the “fused op” staged variant given by the current fusion pass.

I still wonder if there is a way to do this via MergeComposite. Since technically the “from pattern” information is there. From this information I could just build an IRModule (or is it a PackedFunc?) directly from the tvm.lower(my_fused_schedule). I am not sure this is easier, since I dont know exactly where in the relay.build() process I would need to intervene.

I guess it’s possible with unified IR (not 100% for sure tho).

Could you be more precise? I know there is an effort for unifying the IRs but I am don’t see how this would help.

Can I imply that since you answer was short that I didnt have any huge mistakes in my previous post? (especially this part)

I just want to point out that in the VTA example something which kind of replicates what I would like to do.

Although this is technically, the schedule for the conv2d operator. They know that they have a certain set of operators which have been fused as a multistage operator. Therefore they can traverse the stages and log them. Then they do compute_at statements which actually “fuses” the schedules

If the fusion pass would be able to create a multistage operator of the chain of Relay nodes that I want to investigate, then I guess I could do it this way. The sad part is that the fusion pass (as its implemented now) would never create this multistage operator with my pattern.

I agree with @comaniac that the simplest way to achieve this functionality would be to convert the inception block to its own operator, however I feel we should avoid setting a precedent for that to avoid bloat in Relay.

I remember at one point it being discussed that the introduction of the pattern language could allow us to rewrite the FuseOps pass to be more flexible by expressing the fusion rules as patterns. The question then is would you be able to perform the fusion you want to just using the scheduling primitives in TE on the relevant subgraph? If so, this seems like the best solution.

Yeah, I guess nothing stops me from at least trying it out in a private branch and just playing with it.

I kind of remember this also. To be honest, the FuseOps pass is (to me) difficult to understand by reading the source code.

@zhiics any comments on rewriting the FuseOps transform in order to be able to define a Relay level pattern which should generate a multistage “fused op” if found in the workload?

I have already done this yes. The type of schedule I want to generate can be designed with TE expressions

Catching up, sorry. I think the fusions generally use the schedule of the base op and assume something like conv or gemm + elementwise ops. Perhaps @thierry has more ideas on this? I’m not super familiar with more complicated fusions at the tir/schedule level.

I tried catching up with the conversation. As I understand (correct me if I’m wrong), the challenge here is that we are targeting an architecture that requires certain ops to be fused in their TIR-level definition (e.g. a whole inception block). This is not unlike what we had to do for VTA which had some given restrictions. For instance, with VTA you can’t just offload a conv2d operator. It has to be followed by a set of e-wise ops, clip/cast, before the results are sent out. This led us to write the schedule declaration in a way that required “fusing the schedules” manually.

So I agree with the overall solution as proposed by @mbaret and @comaniac, which consists in introduce a new Relay op that has its own fused schedule implementation, as we did with VTA.

1 Like

The way I understand the FuseOps pass is that it creates, from a Relay chain of operators, a multistage operator to be lowered to TIR

//Before FuseOps
op1(args_op1){
//TIR of op1
}
op2(args_op2){
//TIR of op2
}
//Note: these would be handled by 2 different lowering processes (i.e. 2 operators each single stage)
//After FuseOp
op1_op2([args_op1,args_op2]){
//TIR of op1
//TIR of op2
}
//Note: these would be handled by 1 lowering process (i.e. 1 operator with 2 stages)

In some literature this is called “operator stacking”. I think the terminology is very transparent of what it does. It literally creates one function which is the stacked version of each. With no real melting of the schedules.

In the TVM tech report, the following is stated (and I assume that the current FuseOps implementation does the same)

We provide generic rules to fuse these operators, as follows.Multiple injective operators can be fused into another injective operator. A reduction operator can be fused with input injective operators (e.g., fuse scale and sum). Operators such as conv2d are complex-out-fusable, and we can fuse element-wise operators to its output.

In the VTA example as Thierry said

At least the clip/cast are inserted by the quantization process of the network (or at least that is the case in the deploy_detection tutorial).

Since these ops are e-wise and the FuseOps routine can fuse the Conv2D to e-wise operators, then they get a “stacked” version of the Conv2d+following_e-wise, when lowering the conv2d Relay operator. Then create a schedule like:

conv2d_ewise([args_conv2d,args_e_wise1,...]){
//TIR of Intertile Conv2d
    //The indentation is to illustrate that the next TIRs are in subscopes (i.e. inside {}) of the intertile Conv2d
    //TIR of DMA loads
    //TIR of Intratile Conv2d
    //TIR of Intratile e_wise1
    //TIR of Intratile e_wise2
    //...
    //TIR of DMA stores
}

This they do by te.compute_at() statements in their conv2d schedule. But again these “stages” are available here because FuseOps supports them.

If the FuseOps was not able to fuse conv2d with following e-wise operators, then the VTA implementation would have had to either:

  • O1: “Inject” the e-wise ops into the conv2d schedules when lowering to TIR. This would be a problem if they required some variables available at Relay level which are not part of the conv2d argument list.
  • O2: Create Relay ops which represent the operators as those given by the current FuseOps and do Relay level graph rewrites. Which is the suggestion most here are leaning towards

My case is similar to the above problem. The chain of Relay operators I want to first “stack” and then actually “melt” using compute_ats is not supported by FuseOps.

My problems with defining Relay operators which are the “stacked” version of already available Relay operators are:

  • P1: I don’t think the Relay documentation available for adding an operator is sufficient for me to actually do it
  • P2: These new Relay operators would(?) have to be tested in order to make sure that all Relay passes still function as expected. Note that this Relay-level rewrite would need to happen at a similar level as the FuseOps pass.
  • P3: These new Relay operators would most likely be HW-dependent and not necessarily portable. But I guess this can be handled by HW-dependent Relay re-writes enabled.

I think a more elegant solution would be to allow a definition of “stacked operators” at the Relay level. So basically that every MergeComposite creates the “stacked” multistage operator. If the MergeComposite is given to an external codegen then it doesn’t use this “multistage” information, but if it’s given to the TIR lowering process then a specific target can use this information.

@tqchen would you mind giving your thoughts?

@cron did you get this to work by any chance :slight_smile: I am curious because I want to achieve something similar on nvidia gpu. In my case, I have a small-size batchedmatmul in the middle of 2 sequences of e-wise and reduction op. It would be interesting to know the tuning performance when fusing all ops in one kernel. Any rope here?

Hey,

No sorry, I gave up on the idea since I couldnt define a Relay operator which would work given the many passes that TVM does at this level.

Hi cron,

I am curious about the real trouble here. Is it defining a customized relay op or it’s getting around the TVM passes to match the target pattern.

Thanks!

Defining a Relay operator which did not break the Relay level optimizations was the problem

Hi, Do you find the solution now? I have a similar problem. I customize a op in ‘te’ and want to put it in a network from Pytorch. Is there any easy way to registe a relay op?