[Relay] Register op pattern based on target

jonso · August 4, 2019, 5:48am

Is it possible to register an op pattern based on the target in Relay?

For example, I have the TVM-style implementation of matmul and the cblas (external library) implementation of matmul. If my target says to use the TVM version, I want the pattern to be OUT_ELEMWISE_FUSABLE. However, if my target says to use cblas, I want my op pattern to be OPAQUE.

I have tested this and found that OPAQUE has better performance when using an external implementation of matmul.

It already seems possible to register a schedule based on the target.

masahi · August 4, 2019, 11:50pm

No, an op pattern is fixed at module import time. But you can achieve what you want by appending -libs=cblas argument to your target string.

You can have a look at our cuda convolution implementation for the same use case. If the target string is “cuda -libs=cudnn”, we use convolution kernels from cuDNN, and fusion will be disabled. You need to set the op pattern to OUT_ELEMWISE_FUSABLE.

jonso · August 5, 2019, 12:06am

Thanks! How will fusion be disabled? Does it happen when you set the schedule to extern when cudnn is in target.libs?

masahi · August 5, 2019, 12:43am

That’s right. Setting the op pattern to OUT_ELEMWISE_FUSABLE makes convolution along with the following elemwise ops available for fusion, but it is up to the schedule generator whether to generate fused code or not.

jonso · August 5, 2019, 1:00am

Great! Also, it seems that switching on the cblas lib hasn’t been implemented into dense or batch_matmul. I can work on that.

jonso · September 20, 2019, 12:21am

@masahi I actually found that this behavior doesn’t work as expected, and it seems it’s because the TOPI schedule gets updated after fusion takes place. For example, I am using cblas for dense and batch_matmul. Even though dense’s schedule is set to generic.extern, dense is still fused with a lot of ops. This is causing a huge perf regression on my side (> 2x with BERT base).

Do you have any suggestions for how to fix this?

masahi · September 20, 2019, 12:33pm

I don’t understand what you mean by “cblas op is fused with other ops”. How does codegen work in that case (extern ops cannot be fused with tvm generated ops) ?

I’m assuming our cudnn/cublas already work the way you are trying to achieve with cblas. Have you tried your model with our cuda backend and -libs=cublas?

jonso · September 20, 2019, 2:09pm

I am not sure how the codegen actually works, but when I run the model with the debug graph runtime, I see ops named fused_nn_dense_add. On my machine, one of these ops was taking 5ms. After manually changing the op pattern of dense to OPAQUE in the Relay frontend, the ops are no longer fused and dense takes under 1ms.

Can you point me to where I should look in the code to debug?

masahi · September 20, 2019, 2:38pm

Ok, it looks like topi x86 doesn’t have its schedule_extern implementation. See below for CUDA and have a look at how we skip tvm.tensor.ExternOp. You need something similar for x86. Or we can make this cuda schedule_extern available to other backends, since there is nothing specific to cuda in there.

github.com

dmlc/tvm/blob/master/topi/python/topi/cuda/extern.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
# pylint: disable=invalid-name, unused-variable,
"""Schedule for cudnn and miopen extern op"""
import tvm
from .. import generic

This file has been truncated. show original

jonso · September 20, 2019, 4:43pm

I just sent out a PR here: https://github.com/dmlc/tvm/pull/3983

Let me know what you think!