[SOLVED] NMS very slow on CUDA

I’m currently benchmarking a model which uses NMS with CUDA target. After auto-tuning, 98% of inference time is due NMS operator (1.8s on nvidia 1050ti). Do we use an optimized CUDA implementation of NMS?

Set USE_THRUST when building tvm. This enables cuda thrust which hugely accelerates NMS.

1 Like

I didn’t know about this, I’ll give it a try, thanks!

@kevinthesun it worked, thank you! Now the same NMS call takes around 2ms.

To make THRUST work, I had to modify few things though. I’ll document the process here in case someone else will face a similar issue:

  • to build with THRUST, I had to upgrade CMAKE to >= 3.13 and this seems to break CUDA in TVM.
  • whenever I use CMAKE 3.13 or newer, with or without USE_THRUST, I get the following error when trying to run inference on a previously compiled model.
CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: unknown error

To fix thist error, you need to have access to nvidia driver, that I hadn’t from within the docker container I was using. Once running the inference from an environment that have access to the driver (you can check if nvidia-smi command works), inference works fine.

  • or I get the following error when trying to compile a model using relay:
ValueError: arch(sm_xy) is not passed, and we cannot detect it from env

To fix this error, you can refer to [SOLVED] Compile error related to autotvm.

1 Like