Radix_sort: failed on 2nd step: cudaErrorInvalidValue

Hi guys, I’m testing a c++ deployment of the gluoncv’s yolo mobilenet (yolo3_mobilenet1.0_coco).

This rather standard deployment using this basic deployment strategy:

        tvm::runtime::Module *mod = (tvm::runtime::Module *) detector_handle.get();
        tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
        set_input("data", input);
        tvm::runtime::PackedFunc run = mod->GetFunction("run");
        run();
        tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");

This deployment works fine on a Jetson TX2 and a Titan 1080 Ti, however on a Turing based NVIDIA 1660 - all sorts of problems happen.

  1. during the standard compile process, just simply using relay with no tuning, the forward pass is insanely slow (I get why this happens, so we move on to number 2).

  2. during a tuning process similar to this: (https://docs.tvm.ai/tutorials/autotvm/tune_relay_x86.html#sphx-glr-tutorials-autotvm-tune-relay-x86-py) the tuning process for the GPU takes ~10 minutes to forward pass after it’s been ‘tuned’, while cpu takes fractions of a second. I’m not necessarily sure why - even after weeks of changing and augmenting almost every single variable we’re allowed to.

  3. in order to get around that problem on this specific GPU (the 1660), I’ve installed THRUST due to some suggestions from this community, but continually run into:

    terminate called after throwing an instance of 'dmlc::Error'
          what():  [11:13:29] /opt/src/tvm/src/runtime/library_module.cc:78: Check failed: ret == 0 (-1 vs. 0) : radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument
    
    Stack trace:
             [bt] (0) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(dmlc::LogMessageFatal::~LogMessageFatal()+0x4e) [0x55f62b298912]
          [bt] (1) /usr/local/lib/libtvm_runtime.so(+0x7675d) [0x7f60a068775d]
          [bt] (2) /usr/local/lib/libtvm_runtime.so(+0xec957) [0x7f60a06fd957]
          [bt] (3) /usr/local/lib/libtvm_runtime.so(tvm::runtime::GraphRuntime::Run()+0x37) [0x7f60a06fd9d7]
          [bt] (4) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x5a) [0x55f62b2b81aa]
          [bt] (5) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<>() const+0x96) [0x55f62b2c220c]
          [bt] (6) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(PoseFromConfig::forward_full(cv::Mat, float)+0x94f) [0x55f62b2aa9f7]
          [bt] (7) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(TVMPoseNode::callback(boost::shared_ptr<sensor_msgs::Image_<std::allocator<void> > const> const&, boost::shared_ptr<sensor_msgs::Image_<std::allocator<void> > const> const&, boost::shared_ptr<pcl::PointCloud<pcl::PointXYZRGB> const> const&, nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >)+0x1344) [0x55f62b2b3e7e]
          [bt] (8) /opt/catkin_ws/devel/lib/recognition/debug_pose_model(boost::_mfi::mf4<void, TVMPoseNode, boost::shared_ptr<sensor_msgs::Image_<std::a
    

Any individuals with experience in tuning models with baked-in non-max-supression like the gluoncv yolo method, I’d love any help you could provide in getting over these 1660 errors.

Thanks

-Matt