0.7.dev1 cudaErrorCudartUnloading

0.7.dev1 produces a very odd error that doesn’t happen in 0.6.0.

CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: the launch timed out and was terminated

  target = 'cuda -libs=cudnn,cublas'
  block = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
  mod, params = relay.frontend.from_mxnet(block, shape={'data': (1,3,512,512)}, dtype='float32')
  net = mod["main"]
  net = relay.Function(net.params, net.body, None, net.type_params, net.attrs)
  mod = tvm.IRModule.from_expr(net)

  with tvm.transform.PassContext(opt_level=3):
      graph, lib, params = relay.build_module.build(
          mod, target=target, params=params) 

  module = graph_runtime.create(graph, lib, ctx)
  module.set_input(**params)
  module.set_input('data', tvm.nd.array(np.random.randn(*(1,3,512,512)).astype('float32')))
  module.run()
  module.get_output(0)

produces this error

TVMError: Traceback (most recent call last):
  [bt] (3) /home/mkrzus/github/tvm-latest/build/libtvm.so(TVMArrayCopyToBytes+0xa) [0x7f7078d5865a]
  [bt] (2) /home/mkrzus/github/tvm-latest/build/libtvm.so(tvm::runtime::ArrayCopyToBytes(DLTensor const*, void*, unsigned long)+0x189) [0x7f7078d585f9]
  [bt] (1) /home/mkrzus/github/tvm-latest/build/libtvm.so(tvm::runtime::CUDADeviceAPI::CopyDataFromTo(void const*, unsigned long, void*, unsigned long, unsigned long, DLContext, DLContext, DLDataType, void*)+0xee) [0x7f7078dc099e]
  [bt] (0) /home/mkrzus/github/tvm-latest/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x67) [0x7f707834fab7]
  File "../src/runtime/cuda/cuda_device_api.cc", line 213
CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: the launch timed out and was terminated

however when reverting back to 0.6.0 there are no issues with this same model

  target = 'cuda -libs=cudnn,cublas'
  block = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
  mod, params = relay.frontend.from_mxnet(block, shape={'data': (1,3,512,512)}, dtype='float32')

  with relay.build_config(opt_level=3):
      graph, lib, params = relay.build(
          mod, target=target, params=params) 

  module = graph_runtime.create(graph, lib, ctx)
  module.set_input(**params)
  module.set_input('data', tvm.nd.array(np.random.randn(*(1,3,512,512)).astype('float32')))
  module.run()
  module.get_output(0)
  # no error. 

This error happens on both Titan X and Jetson TX2

I found tvm took unusually long time when running similar code. It seems that the generated CUDA kernel may have problems. The GPU I used is Tesla V100 and CUDA 10.0.

import tvm
from tvm import relay
from tvm.contrib import graph_runtime
from gluoncv import model_zoo
import numpy as np

#target = 'cuda -libs=cudnn,cublas'
target = 'cuda'
ctx = tvm.gpu(0)
block = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
mod, params = relay.frontend.from_mxnet(block, shape={'data': (1,3,512,512)}, dtype='float32')

with relay.build_config(opt_level=3):
    graph, lib, params = relay.build_module.build(
        mod, target=target, params=params)

module = graph_runtime.create(graph, lib, ctx)
module.set_input(**params)
module.set_input('data', tvm.nd.array(np.random.randn(*(1,3,512,512)).astype('float32')))
module.run()
print(module.get_output(0))

The profiling suggests fused_vision_non_max_suppression_kernel1 took very long time to finish:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  251.085s         1  251.085s  251.085s  251.085s  fused_vision_non_max_suppression_kernel1
                    0.00%  11.873ms       170  69.842us  1.4720us  2.3369ms  [CUDA memcpy HtoD]
                    0.00%  1.3170ms         3  439.01us  436.61us  441.18us  fused_nn_conv2d_add_nn_leaky_relu_kernel0
                    0.00%  948.35us         3  316.12us  314.53us  318.33us  fused_nn_conv2d_add_nn_leaky_relu_2_kernel0
                    0.00%  822.33us         5  164.47us  163.07us  166.02us  fused_nn_conv2d_add_nn_relu_4_kernel0
                    0.00%  781.53us         3  260.51us  258.62us  262.43us  fused_nn_conv2d_add_nn_leaky_relu_6_kernel0
                    0.00%  626.24us         3  208.75us  208.29us  208.99us  fused_nn_conv2d_add_nn_leaky_relu_1_kernel0
                    0.00%  242.85us         2  121.42us  121.22us  121.63us  fused_nn_conv2d_add_nn_leaky_relu_3_kernel0
...
      API calls:   99.55%  251.100s       171  1.46842s  8.7930us  251.085s  cudaMemcpy
                    0.33%  840.11ms         1  840.11ms  840.11ms  840.11ms  cuModuleLoadData
                    0.11%  287.13ms       188  1.5273ms  2.9390us  282.34ms  cudaMalloc
                    0.01%  12.865ms       188  68.432us  4.5370us  6.3585ms  cudaFree
                    0.00%  1.8782ms         1  1.8782ms  1.8782ms  1.8782ms  cuModuleUnload
...

Asked some TVM folks and it seems that you may need to turn on thrust in config.cmake by setting set(USE_THRUST ON). Otherwise TVM uses the brute force argsort and causes the super slow behavior of the nms algorithm. cc @KayneWest

1 Like

Thanks for the reply. Turned on THRUST and it runs perfectly. Just odd that the previous, non-thrust version that was running ridiculously fast all of a sudden stopped performing well. But it is what it is.