[Relay] Register op pattern based on target

Is it possible to register an op pattern based on the target in Relay?

For example, I have the TVM-style implementation of matmul and the cblas (external library) implementation of matmul. If my target says to use the TVM version, I want the pattern to be OUT_ELEMWISE_FUSABLE. However, if my target says to use cblas, I want my op pattern to be OPAQUE.

I have tested this and found that OPAQUE has better performance when using an external implementation of matmul.

It already seems possible to register a schedule based on the target.

No, an op pattern is fixed at module import time. But you can achieve what you want by appending -libs=cblas argument to your target string.

You can have a look at our cuda convolution implementation for the same use case. If the target string is “cuda -libs=cudnn”, we use convolution kernels from cuDNN, and fusion will be disabled. You need to set the op pattern to OUT_ELEMWISE_FUSABLE.

Thanks! How will fusion be disabled? Does it happen when you set the schedule to extern when cudnn is in target.libs?

That’s right. Setting the op pattern to OUT_ELEMWISE_FUSABLE makes convolution along with the following elemwise ops available for fusion, but it is up to the schedule generator whether to generate fused code or not.

Great! Also, it seems that switching on the cblas lib hasn’t been implemented into dense or batch_matmul. I can work on that.

@masahi I actually found that this behavior doesn’t work as expected, and it seems it’s because the TOPI schedule gets updated after fusion takes place. For example, I am using cblas for dense and batch_matmul. Even though dense’s schedule is set to generic.extern, dense is still fused with a lot of ops. This is causing a huge perf regression on my side (> 2x with BERT base).

Do you have any suggestions for how to fix this?

I don’t understand what you mean by “cblas op is fused with other ops”. How does codegen work in that case (extern ops cannot be fused with tvm generated ops) ?

I’m assuming our cudnn/cublas already work the way you are trying to achieve with cblas. Have you tried your model with our cuda backend and -libs=cublas?

I am not sure how the codegen actually works, but when I run the model with the debug graph runtime, I see ops named fused_nn_dense_add. On my machine, one of these ops was taking 5ms. After manually changing the op pattern of dense to OPAQUE in the Relay frontend, the ops are no longer fused and dense takes under 1ms.

Can you point me to where I should look in the code to debug?

Ok, it looks like topi x86 doesn’t have its schedule_extern implementation. See below for CUDA and have a look at how we skip tvm.tensor.ExternOp. You need something similar for x86. Or we can make this cuda schedule_extern available to other backends, since there is nothing specific to cuda in there.

1 Like

I just sent out a PR here: https://github.com/dmlc/tvm/pull/3983

Let me know what you think!