CUDA_ERROR_INVALID_PTX when trying to run TensorFlow DeeplabV3+ model

Hi,

I imported DeeplabV3+(xception) model named ‘xception65_coco_voc_trainval’ downloaded from TF model zoo (https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md) It runs well on CPU but gets some error on GPU.

target = tvm.target.cuda()
ctx = tvm.gpu(0)
model_path = '/tensorflow/deeplabv3_pascal_train_aug/frozen_inference_graph.pb'
image_url = "/deeplab/test_dataset/image3.jpg"

with tf.gfile.GFile(model_path, 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    graph = tf.import_graph_def(graph_def, name='')
    graph_def = tf_testing.ProcessGraphDefParam(graph_def)


image = Image.open(image_url)
image_resize = image.convert('RGB').resize((513, 513))
x = np.array(image_resize)
x = np.expand_dims(x, 0)

mod, params = relay.frontend.from_tensorflow(graph_def, layout='NHWC', shape=x.shape)
with relay.build_config(opt_level=3):
    json, lib, params = relay.build(mod,
                                     target=target,
                                     params=params)

module = runtime.create(json, lib, ctx)
input = "ImageTensor"
module.set_input(key=input, value=x, **params)
module.run()
out_0 = module.get_output(0).asnumpy()

The error is shown below:

[17:36:18] /home/incubator-tvm/src/te/schedule/bound.cc:119: not in feed graph consumer = compute(placeholder_red_temp.repl, 0x11f290c0)
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 256), 'float32'), ('TENSOR', (1, 1, 256, 21), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 256), 'float32'), ('TENSOR', (1, 1, 256, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 304), 'float32'), ('TENSOR', (1, 1, 304, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 1280), 'float32'), ('TENSOR', (1, 1, 1280, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 1, 1, 2048), 'float32'), ('TENSOR', (1, 1, 2048, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 1536), 'float32'), ('TENSOR', (1, 1, 1536, 2048), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 1536), 'float32'), ('TENSOR', (1, 1, 1536, 1536), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 1024), 'float32'), ('TENSOR', (1, 1, 1024, 1536), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 1024), 'float32'), ('TENSOR', (1, 1, 1024, 1024), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 728), 'float32'), ('TENSOR', (1, 1, 728, 1024), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 728), 'float32'), ('TENSOR', (1, 1, 728, 728), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 256), 'float32'), ('TENSOR', (1, 1, 256, 728), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 256), 'float32'), ('TENSOR', (1, 1, 256, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 128), 'float32'), ('TENSOR', (1, 1, 128, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 128), 'float32'), ('TENSOR', (1, 1, 128, 128), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 257, 257, 128), 'float32'), ('TENSOR', (1, 1, 128, 128), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 257, 257, 64), 'float32'), ('TENSOR', (1, 1, 64, 128), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 257, 257, 32), 'float32'), ('TENSOR', (3, 3, 32, 64), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc_winograd_direct.cuda', ('TENSOR', (1, 257, 257, 32), 'float32'), ('TENSOR', (3, 3, 32, 64), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 515, 515, 3), 'float32'), ('TENSOR', (3, 3, 3, 32), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 257, 257, 64), 'float32'), ('TENSOR', (1, 1, 64, 128), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 128), 'float32'), ('TENSOR', (1, 1, 128, 256), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 65, 65, 2048), 'float32'), ('TENSOR', (1, 1, 2048, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 129, 129, 256), 'float32'), ('TENSOR', (1, 1, 256, 48), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
running TVM deeplab on image ...
Traceback (most recent call last):
  File "/home/incubator-tvm/tutorials/test_tensorflow/deeplab/deeplabv3_plus_tvm.py", line 168, in <module>
    module.run()
  File "/home/incubator-tvm/python/tvm/contrib/graph_runtime.py", line 177, in run
    self._run()
  File "/home/incubator-tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 225, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (3) /home/incubator-tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f5094810c65]
  [bt] (2) /home/incubator-tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xb6) [0x7f50948aac56]
  [bt] (1) /home/incubator-tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x9df) [0x7f50948aaa5f]
  [bt] (0) /home/incubator-tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7f5093e640b2]
  File "/home/incubator-tvm/src/runtime/cuda/cuda_module.cc", line 105
  File "/home/incubator-tvm/src/runtime/library_module.cc", line 78
CUDAError: Check failed: ret == 0 (-1 vs. 0) : cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

Process finished with exit code 1

I tried to tune it by AutoTVM, but for some conv ops, the GFLOPS is always 0.00

[Task 17/26]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (132/1296) | 50.07 s
WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode

The problem has been solved. Replace ‘NHWC’ to ‘NCHW’ and tune ‘conv2d’.

mod, params = relay.frontend.from_tensorflow(graph_def, layout='NCHW', shape=x.shape)