[BYOC][Layout conversion] How to use Convert layout pass with BYOC flow?

@anijain2305 I realized that in my BYOC flow, I injest QNN graph directly before canonicalize, so what I really need is conversion from NCHW qnn.convolution to NCHWc one. Does QNN convolution supports NCHWc layout? From looking at qnn/op/convolution.cc, it seems it only supports NCHW and NHWC?

Ah, yes you are right. QNN ops do not support NCHWc layout (by design).

Not sure if that fits with your case, but you can call Legalize to first convert to Relay and then call ConvertLayout. The problem here would be that you will have quite large pattern matchers for MergeComposite (if you were using QNN ops earlier).

Looking at qnn/op/convolution.cc a bit more, isn’t it the case that canonicalize of QNN conv ops with NCHWc layout is not supported, but it is still possible to have qnn.op.convolution with NCHWc layout in a graph, as long as I don’t call canonicalize?

This fits my use case because I don’t need to call qnn.canonicalize. QnnConv2DRel and QnnConvInferCorrectLayout need to work properly with NCHWc layout, and they look ok.

I see. As a custom pass (or custom changes), that might be ok.

hi, I’m trying to enable the alteroplayout to trasnform the layout of a dnnl function. Have you solved the problem? I am quite confused of how to get the function information.

Hi, I am facing the same issue where I have a function annotated for an external codegen. The function requires inputs/outputs conversion (layout and data precision) and I would like to insert such conversion nodes around the function. notes: (1) I cannot insert conversion nodes inside the function as the external codegen cannot handle them (2) I have to do it at function level, rather than op level.

I could write a pass to visit FunctionNode and add conversion ops. In that case: (1) which ops should I use? cast+LayoutTransform make sense? (2) Suppose I add cast+LayoutTransform, and then rerun type inference, should it work? if not what is the better approach? Modifying the Function type would probably work as the function is considered a black box but it seems hacky and ugly to me.

Please share insights if you have any?

Thanks a lot

Yes, cast+LayoutTransform should work and type inference should succeed as long as you insert cast in a consistent way.

You can pattern match against the output of layout_transform, this will put cast+LayoutTransform outside of external functions and TVM (host) will handle them for you. External functions get transformed values as parameters at runtime.

Thanks for the answer. I did not really get the point regarding the pattern match. Can you please explain what you mean?

I guess I wasn’t clear enough but the thing is that I must call the external function codegen to know the exact conversion (layout+type) required per input/output. So the flow is a little weird: (1) optimize, (2) subgraph, (3) codegen for the external functions, (4) insert the conversion ops before and after the external function calls and (5) some more passes+codegen for the rest

I was thinking about a mutator to mutate call nodes: mutate args such that each arg x is replaced by cast(transform_layout(x)) and the call node itself is mutated to be cast(convert(call node)).

Does it make sense? or do you think I should do it otherwise?

Thanks

Usually we run layout transform or dtype conversion (fp16 quantization etc) before running BYOC flow. I’m not sure what you are trying to achieve with (1) - (5), the order seems odd. In the standard BYOC flow, (3) is the last step which happens after (5). The step (4) doesn’t exist because conversion ops are inserted before external functions are created.

I agree that the flow is odd.

Note that I don’t use relay.build() because the HW solution is very different, compared to other solutions (and the required compilation artifacts as well) so I had to write another compilation flow, which allows me to do so. So I am not restricted to relay.build() compilation flow.

I integrate to a third party DLA compiler which requires conversions (type, layout, padding etc.). The problem is that I don’t know which transformation is required prior to calling this compiler. So I must subgraph the relay code, call the external compiler and only then I have the knowledge regarding what transformations are required.

Assume I cannot integrate the logic of the external DLA compiler in TVM so I must call the codegen to get this knowledge.

To complicate things even more, the transformations will be offloaded to another HW (not host CPU. This HW is also treated as external codegen) and no host is involved between the transformation and the DLA function call. So the flow I have to handle is (1) subgraph for DLA, (2) codegen for DLA, (3) insert transformations based on the DLA compiler request and then (4) subgraph again for the other accelerators (including the transformation HW) and (5) call codegen for the remaining targets.

That seems like a very odd flow, question… what exactly is handled by the “host” in your case?

So what is the problem of doing up to (1) via the BYOC flow and then implementing (2) and (3) as another ExprMutator pass which explicitly also does the partitioning you expect for (4) and then just take that as the input to (5) ?

Thanks, that’s what I am doing yes. My host basically sees the entire graph as a single op.