[BYOC][Layout conversion] How to use Convert layout pass with BYOC flow?

masahi · June 17, 2020, 9:22am

Hi, suppose I have a set of external functions created with the BYOC flow. Currently I disable AlterOpLayout pass, so everything is in the NCHW layout. I want to achieve the following:

All extern functions take and output NCHWc value
Ops that I fall back to CPU should also use and keep NCHWc layout as much as possible using existing infra in Relay and topi.

How should I do this? Since AlterOpLayout seems to be something to be used with topi, I think this is not what I’m looking for. There is also ConvertLayout pass, but from briefly looking at this interface it seems to work by op-by-op fashion.

I want to be able to say

For all calls to functions marked with a particular compiler, treat them as if it was a single op that takes and outputs NCHWc layout
Other ops should be layout-transformed according to AlterOpLayout

@anijain2305 @zhiics @mbaret @lhutton1

lhutton1 · June 17, 2020, 9:37am

What happens when you enable AlterOpLayout? I believe functions marked with a particular compiler should now be skipped, if so that would solve point 2.

In terms of converting the layout for a BYOC function I don’t think a pass exists currently that would do that. ConvertLayout inserts layout transforms as and when needed, whereas this would need layout transforms inserting at the boundaries of the function.

masahi · June 17, 2020, 9:55am

I tried enabling AlterOpLayout on resnet, but there is no layout transform op inserted. Probably because all convolution ops are sent to extern functions.

Thanks this is also my impression. I wanted to be sure before I go ahead implementing a custom pass.

lhutton1 · June 17, 2020, 11:04am

No problem, I look forward to seeing it

masahi · June 17, 2020, 12:50pm

I hope the following could work:

Add src and dst layouts as new attributes to Function
Extend InferCorrectLayouts that currently assumes call->op to be Op, to take care of Function too
Similarly, extend AlterOpLayout to handle Function. This requires passing a user defined Function -> Function rewriter that decorates Function with src and dst layouts, in place of falter_layout used for op conversion.

cc @anijain2305

zhiics · June 17, 2020, 4:06pm

We intentionally disable the Relay optimization pipeline on the functions annotated with compiler because 1) external compilers/library may have their own optimization pipelines, and 2) it is hard to know which one should be applied to which backend. However, the current optimization should still apply to the functions that are fell back to TVM.

In fact, I think we can still apply generally Relay optimizations in side the external codegen before we emit code. This way, we allow vendor to leverage the Relay passes that they think it would be helpful. For example, we can customize an optimization pipeline in codegen.cc on the received function.

lhutton1 · June 17, 2020, 4:29pm

Yes that’s possible after #5615

For example, in our ACL codegen we currently use ConvertLayout on ACL functions using something like so:

runtime::Module ACLCompiler(const ObjectRef& ref) {
 if (ref->IsInstance<IRModuleNode>()) {
    IRModule mod = Downcast<IRModule>(ref);
    preprocessed_module = transform::ConvertLayout(...)(mod);
    return Codegen(preprocessed_module);
 }
}

TVM_REGISTER_GLOBAL("relay.ext.acl").set_body_typed(ACLCompiler);

anijain2305 · June 17, 2020, 6:07pm

Hi Masahi,

I think Zhi described it well. 3rd party vendors can pick whatever transforms they want for their HW. So, you can run ConvertLayout on extern functions.

But, as you saw it might not help for your usecase - NCHWc layout. ConvertLayout today converts the layout of conv2d and then adapts to the new layout for rest of the ops. If your extern does not have conv2d, ConvertLayout won’t work. I am not sure if your proposal of extending AlterOp and ConvertLayout to functions will solve this problem.

Additionally, I don’t think the problem can be solved just by inserting transforms at the being and end. If there are reduce ops in the graph, or pad or anything that has an axis attribute, the graph will need a transformation.

masahi · June 17, 2020, 6:30pm

To be clear, I’m not trying to run ConvertLayout or AlterOpLayout inside extern functions. On the contrary, I want TVM to treat my external functions as black boxes.

I’m talking about the interface between TVM and external functions. For example, if my main module is something like this:

def main(...)
    %1 = ex_fun1(%0) // some complicated extern func
    %2 = max_pooling(%1) // falls back to CPU
    %3 = ex_fun2(%2) // another extern_func
    %4 = global_pool(%3) // falls back to CPU
    %5 = dense(%4)

I want to turn it something like this. TVM doesn’t or shouldn’t look inside extern functions.

def main(...)
    %1 = layout_transform(%0) // NCHW -> NCHW8c
    %2 = ex_fun1(%1) // Now takes and outputs NCHW8c layout
    %3 = max_pooling(%2) // falls back to TVM, stay NCHW8c
    %4 = ex_fun2(%3)
    %5 = global_pool(%4) // falls back to CPU
    %6 = layout_transform(%5) // NCHW8c -> NCHW
    %7 = dense(%6)

Of course, to do that I need to specify for each extern functions their desired src and dst layout to/from TVM.

comaniac · June 18, 2020, 5:12pm

I see your point. We could discuss about how to make it work for your case. Before that, it would be great if you could share the motivation of making this support. Is that because you want the fallabck ops to benefit from the performance of NCHWc? Or your codegen would perofrm better by taking NCHWc layout?

For the former use case, we did assume the external codegen could process all conv ops well because that’s the most obvious performance bottleneck and usually the motivation of leveraging accelerators. As a result, NCHWc does not benefit the fallback ops.

For the latter use case, our impression to BYOC codegen developers is that they wish the interface could be as simple as possible, because they usually have an end-to-end compilation flow that accepts a model (i.e., a graph). Since NCHWc is a special layout that only used by TVM (at least this layout is not exposed to end-users). As a result, I would suggest embedding layout transform from NCHW to NCHWc inside the external codegen. After all, AutoTVM cannot help determine which c is the best for the external codegen.

masahi · June 18, 2020, 7:34pm

Yes this is what is happening right now inside ex_fun1, ex_fun2 right now. Inside my extern function, I operate entirely on NCHWc layout.

The motivation is very simple. It is for removing unnecessary layout transform at the beginning and end of each of my extern function. If TVM (host side) also operates on NCHWc layout, it should be possible to keep NCHWc layout end to end.

The fallback ops are supposed to be cheap, so I don’t really care if NCHWc layout would improve their perf or what AutoTVM does to fall back ops. You can assume that I know the right inner c dim, so it is not a responsibility of TVM to figure out this value.

If you think about how layout transform works in TVM (InferCorrectLayout + way to turn NCHW op to NCHWc op (conv2d -> conv2d_NCHWc etc) and look at my module below again,

masahi:

def main(...)
    %1 = ex_fun1(%0) // some complicated extern func
    %2 = max_pooling(%1) // falls back to CPU
    %3 = ex_fun2(%2) // another extern_func
    %4 = global_pool(%3) // falls back to CPU
    %5 = dense(%4)

I think the only new requirement is for TVM to know what the src and dst layout of callee functions. The layout transform pass can treat these extern func calls as if they were just like any other ops, as long as src and dst layout are specified. Then InferCorrectLayout should be able to infer the right layout constraint between ops (and extern functions), and nn.layout_transform would be inserted at the right place.

So it is really a simple generalization of existing passes. The new requirement is to handle the case when call->op is not Op, but Function. Does this make sense? cc @anijain2305

I hope by now my possible-solution below also makes sense.

anijain2305 · June 18, 2020, 8:43pm

In that case, does it make sense to run ConvertLayout first before calling External Codegen Partitioner?

Your proposal looks like an example of global optimization after partitioning is done. Current infrastructure, IIUC, supports local optimizations for subgraphs, assuming that we don’t need to pass any information from external subgraphs to fallback portion. In your case, you external function realizes that it needs NCHWc layout and it wants to propagate that information to fallback graphs and hoist/sink the layout transforms from the external func to the fallback graph, and then possibly optimize it away.

If we are ok with this passing of information from the external funcs back to fallback graph, your proposal make sense. I have not worked heavily on external codegen portion of TVM, so maybe @comaniac @zhiics can comment on it.

comaniac · June 18, 2020, 8:57pm

@masahi thanks for the clarification. Now I think I fully understand this issue.

While I agree with the 2nd and 3rd steps in your proposal, I am concerned the 1st step. External functions may generate the output tensor in NCHWc layout, so it makes sense to somehow change the output shape of external functions. However, the function still takes NCHW layout from TVM/Relay’s point of view, because like you mentioned, the transfrom from NCHW to NCHWc happens inside the external function. It’s weird for me to see a function that takes NCHW input with an attribute saying the input layout is NCHW8c.

From the execution pipeline’s perspective, like @anijain2305 mentioned, we have no idea about the desired layout before running the external codegen, and even the external codegen is allowed to only visit but not mutate the partitioned function. Combining those points, the flow would be a bit weird:

mod = PartitionGraph()(mod) # Have no idea about what layout to be used.
mod = AlterOpLayout()(mod) # Have no idea about what layout to be used.
relay.build(mod) # External codegen is invoked so we know the layout.

IMHO, the direction we can consider is how to provide the layout information required by the external codegen. For example, if we somehow know that all external functions of compiler=dnnl takes and outputs NCHW8c layout, then we may have the following pipeline:

mod = MergeComposite(dnnl_patterns)(mod)
mod = AnnotateTarget('dnnl')(mod)
mod = MergeCompilerRegion()(mod)
mod = PartitionGraph()(mod)
mod = AlterFuncLayout(data='NCHW8c', output='NCHW8c')(mod)
relay.build(mod)

Of course the details of AlterFuncLayout can be further discussed. Another option is simply letting the developers write a custom pass to deal with the desired layouts.

masahi · June 18, 2020, 9:18pm

@anijain2305

I wouldn’t call it global optimization. I think it is much simpler than that.

I think AlterOpLayout is driven by the existence of NCHW convolution ops. In my codegen, all NCHW convolution ops are sent to my external compiler, so TVM doesn’t see any NCHW convolution after partitioning. So no layout transform would happen if I invoke AlterOpLayout on the partitioned graph.

But instead of NCHW convolution, each of my extern function, decorated with src and dst layout, can derive AlterOpLayout pass to propagate NCHWc constraint to the rest of graph. So from the layout transform pass perspective, if it knows know the src and dst layout of extern functions, it should be no more complicated than the case of converting a graph with NCHW convolutions to NCHWc convolutions.

The conversion of NCHW -> NCHWc convolution happens inside topi via _alter_conv2d_layout. Since we don’t have an equivalent way to “convert” Function to NCHWc layout, yes, we need some way to tell this information to TVM.

I haven’t really thought about this. I already have a codegen pipeline assuming the graph from TVM comes in NCHW. I later realized that at runtime I’m doing all these NCHW -> NCHWc conversion and back, so I want to do something about it.

In my use case, I always work on a fixed layout, since this is a part of HW requirement. So there is no need to have nn.layout_transform op explicitly in the extern function.

Moreover, if we run ConvertLayout first before BYOC flow, I need to take into account nn.layout_transform ops and conv2d_NCHWc ops in my pattern matching. I want to avoid this complication.

masahi · June 18, 2020, 9:36pm

The point is, “the transfrom from NCHW to NCHWc happens inside the external function” is my status quo, that I want to change. So if my goal is achieved, the extern function does take NCHWc layout from TVM. There is no ambiguity.

The workflow I imagine is something like this:

mod = PartitionGraph()(mod) # Everything NCHW, works as usual
mod = AlterOpLayout(func_converter)(mod)
# func_converter should take `Function` and decorates src and dst layout
relay.build(mod)

func_converter is meant to be used for the case when call->op is Function, in place of falter_layout inside AlterTransformMemorizer. So I don’t think we need a separate pass AlterFuncLayout to take care of Functions.

As I said in the previous post, it should be no more complicated than the case of converting a graph with NCHW convolutions to NCHWc convolutions.

anijain2305 · June 18, 2020, 9:37pm

Yes, you can do that. One way to think about this is that the extern_func is opaque now, and can be considered as a new fused op. And then you are specifying the src/dest layout of this new fused op. From that perspective, it might be ok.

Overall, I have the same concerns that @comaniac mentioned. I am also not sure how TypeInference works with the extern functions and if that would change with your proposal.

masahi · June 18, 2020, 10:00pm

Hmm interesting. Since we don’t have something equivalent of Conv2DRel for extern functions, I’m not sure how input and output shape of extern function is handled during type inference. If we change the layout, indeed we need to do something about shape information during type inference.

If type inference looks at ops inside extern functions to infer shape, it would be necessary to propagate NCHWc layout inside extern function. In that case, running AlterOpLayout inside extern function or before partitioning could achieve my goal. I have to rewrite my codegen to injest NCHWc graph from TVM, but if this is the only way I’m ok with that.

comaniac · June 18, 2020, 10:09pm

I guess what I was thinking somewhat aligns your idea of AlterOpLayout(func_converter)(mod). Just the programming model looks a bit confusion. That’s why I was using another name, but I am open if everyone agrees with this proposal.

A more general consideration would be upgrading AlterOpLayout to AlterLayout that processes both ops and functions. In this case we also extracts the logic of converting ops and put it together with the logic of converting functions. This is just my rough thinking tho.

masahi · June 18, 2020, 10:28pm

Yes I think we are on the same page. I think my original proposal would work up to the layout transform stage, but I now realized the possibility of type inference failure (since the function takes NCHWc layout but the convolution inside is still nn.conv2d from Relay perspective), if extern function is not treated as a opaque op (like @anijain2305 mentioned) during type inference.

@anijain2305 AlterOpLayout is associated with each target (x86, ARM). If I decide to run a layout transform pass before BYOC, I want to convert NCHW convolutions to NCHWc convolution according to my need (the choice of inner c dim etc). Is ConvertLayout the right pass to use here? I have no experience with this.

anijain2305 · June 18, 2020, 10:27pm

Yes, ConvertLayout is more suitable here. Maybe this could help - https://docs.tvm.ai/dev/convert_layout.html