[Auto-tune] Error occurs during inference when using auto-tuned schedule

Dear TVM community,

I found that the auto-tuned schedule for the conv2d [in_channels=80, out_channels=192, kernel_size=(3,3), strides=(1,1), padding=(0,0)] would make the inference failed. And the default schedule works well.

The relay program is:

def get_network(name, batch_size):
    """Get the symbol definition and random weight of a network"""
    weight_name = 'weight'
    input_shape = (1, 80, 73, 73)
    output_shape = (1, 192, 71, 71)
    params = {}
    weight = tvm.nd.empty(shape=(192, 80, 3, 3))
    params[weight_name] = weight
    v = relay.var('data', shape=input_shape)  # 1 x 80 x 73 x 73
    v = relay.nn.conv2d(v,
            weight=relay.var(weight_name,shape=weight.shape), 
            strides=(1, 1), padding=(0, 0), 
            channels=192, kernel_size=(3, 3))  # 1 x 192 x 71 x 71
    fn = relay.Function(relay.analysis.free_vars(v), v)
    mod = relay.Module.from_expr(fn)
    return mod, params, input_shape, output_shape

Here is the code to reproduce the error and the error message is:

Extract tasks...
Tuning...
[Task  1/ 1]  Current/Best: 2233.51/3996.47 GFLOPS | Progress: (100/100) | 634.76 s Done.
Compile...
Evaluate inference time cost...
Traceback (most recent call last):

  File "/home/yaoyao/PycharmProjects/play_ground/demo.py", line 154, in <module>
    tune_and_evaluate(tuning_option)

  File "/home/yaoyao/PycharmProjects/play_ground/demo.py", line 148, in tune_and_evaluate
    prof_res = np.array(ftimer().results) * 1000  # convert to millisecond

  File "/home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/module.py", line 198, in evaluator
    blob = feval(*args)

  File "tvm/_ffi/_cython/./function.pxi", line 310, in tvm._ffi._cy3.core.FunctionBase.__call__

  File "tvm/_ffi/_cython/./function.pxi", line 245, in tvm._ffi._cy3.core.FuncCall

  File "tvm/_ffi/_cython/./function.pxi", line 234, in tvm._ffi._cy3.core.FuncCall3

  File "tvm/_ffi/_cython/./base.pxi", line 171, in tvm._ffi._cy3.core.CALL

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x65) [0x7f2de73be845]
  [bt] (4) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(+0xc26c07) [0x7f2de7416c07]
  [bt] (3) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(+0xc268ba) [0x7f2de74168ba]
  [bt] (2) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x37) [0x7f2de7426557]
  [bt] (1) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(+0xc38727) [0x7f2de7428727]
  [bt] (0) /home/yaoyao/anaconda3/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(+0xbe5125) [0x7f2de73d5125]
  File "/home/yaoyao/repos/tvm/src/runtime/module_util.cc", line 73
TVMError: Check failed: ret == 0 (-1 vs. 0) : Assert fail: (73 == int32(arg2.shape[2])), Argument arg2.shape[2] has an unsatisfied constraint

It would be very appreciated if you can help!

UPDATE:
I tried the same code on NVIDIA TITAN XP and it does not have the problem and works well.
But the error will occur when I run the code on GTX 1070, GTX 1080 and RTX 2080Ti.

When the model is compiled under tuned schedule with opt_level = 3, the input shapes are
(1, 80, 73, 73) and (4, 4, 80, 192) and in this case, there will be the above error.

When the model is compiled under tuned schedule with opt_level = 2, the input shapes are
(1, 80, 73, 73) and (192, 80, 3, 3) and in this case, the module works well.

So the problem might come from some optimization related to data layout?

I second this issue. For me it happened when compiling for the Jetson TX2 after auto tuning. Setting opt_level=2 also resolved this for me. But then opt_level=3 with the fallback was more preformant than the auto tuned network with opt_level=2

Thanks for your reply.

I have not run it on Jetson TX2, but he tuned model with opt_level = 2 is faster than the model under default schedule with opt_level = 3 on Tesla V100 in my experiment.

It seems to me that there’s some problems in the schedule after _alter_conv2d_layout is inserted, which only happens when opt_level=3.

cc @merrymercy @vinx13

PR filed: https://github.com/apache/incubator-tvm/pull/4260

2 Likes