[Solved] Incorrect libcuda.so when multiple versions of CUDA exist -- Problem and Solution

junrushao · October 7, 2018, 12:45am

TL;DR The current FindCUDA.cmake might probably find a wrong libcuda.so when there are multiple CUDA versions exist, even if USE_CUDA is set properly. This could cause building TVM to fail.

The issue.

# There is a default libcuda under `/usr/lib64/`
$ ll /usr/lib64/ | grep libcuda.so
lrwxrwxrwx  1 root root       12 Apr 17 15:21 libcuda.so -> libcuda.so.1
lrwxrwxrwx  1 root root       17 Apr 17 15:21 libcuda.so.1 -> libcuda.so.390.48
-rwxr-xr-x  1 root root 10033592 Apr 17 15:21 libcuda.so.390.48

# Another libcuda.so under `/opt/cuda/9.0`
$ ll /opt/cuda/9.0/lib64/stubs/ | grep libcuda.so
-rwxr-xr-x 1 root root  42176 Feb  6  2018 libcuda.so

# `USE_CUDA` is set properly
$ cat config.cmake | grep USE_CUDA
set(USE_CUDA /opt/cuda/9.0/)

# And `LD_LIBRARY_PATH` and ldconfig are correct
$ ldconfig -p | grep libcuda.so
        libcuda.so.1 (libc6,x86-64) => /opt/cuda/9.0/lib64/stubs/libcuda.so.1

$ echo $LD_LIBRARY_PATH
/opt/cuda/9.0/lib64/stubs/:/opt/cuda/9.0/lib64

In this case, if we print out all CUDA-related variables in FindCUDA.cmake, we get

-- Custom CUDA_PATH=/opt/cuda/9.0/
-- CUDA_FOUND=TRUE
-- CUDA_INCLUDE_DIRS=/opt/cuda/9.0//include
-- CUDA_TOOLKIT_ROOT_DIR=/opt/cuda/9.0/
-- CUDA_CUDA_LIBRARY=/usr/lib64/libcuda.so ######### Incorrect #########
-- CUDA_CUDART_LIBRARY=/opt/cuda/9.0/lib64/libcudart.so
-- CUDA_NVRTC_LIBRARY=/opt/cuda/9.0/lib64/libnvrtc.so
-- CUDA_CUDNN_LIBRARY=CUDA_CUDNN_LIBRARY-NOTFOUND
-- CUDA_CUBLAS_LIBRARY=/opt/cuda/9.0/lib64/libcublas.so

It causes a link-time error when building TVM:

/my/own/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
make[2]: *** [libtvm_runtime.so] Error 1
make[1]: *** [CMakeFiles/tvm_runtime.dir/all] Error 2

Cause of the issue.
FindCUDA prefers /usr/lib64, which is implicitly included, so that the customized path is ignored in the following cmake command:

find_library(_CUDA_CUDA_LIBRARY cuda
  PATHS ${CUDA_TOOLKIT_ROOT_DIR}
  PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs)

Proposed solution.

When a customized USE_CUDA is provided, I suggest do not including the default paths into the search path, which should look like:

find_library(_CUDA_CUDA_LIBRARY cuda
  PATHS ${CUDA_TOOLKIT_ROOT_DIR}
  PATH_SUFFIXES lib lib64 lib64/stubs targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs
  NO_DEFAULT_PATH)  ###### disable default path here  ######

Discuss.
Many university clusters have multiple CUDA versions installed, so it might become a little issue some time…I don’t think it is a big deal though…

And why not we print out all those paths found by FindXXXX.cmake? For example, when running cmake .. to generate the Makefile, it could print the information below so that the users would be aware which dependencies are used.

# For FindCUDA.cmake
- CUDA_FOUND=
- CUDA_INCLUDE_DIRS=
- CUDA_TOOLKIT_ROOT_DIR=
- CUDA_CUDA_LIBRARY=
- CUDA_CUDART_LIBRARY=
- CUDA_NVRTC_LIBRARY=
- CUDA_CUDNN_LIBRARY=
- CUDA_CUBLAS_LIBRARY=

# For FindLLVM.cmake
- LLVM_INCLUDE_DIRS=
- LLVM_LIBS=
- LLVM_DEFINITIONS=
- TVM_LLVM_VERISON=

junrushao · October 7, 2018, 12:49am

This issue is solved by #1788.

Another issue here is now cuDNN is assumed to be inside ${CUDA_HOME}, which is not true. In clusters in many universities, there are several versions of cuDNN installed, and we should be able to choose which version to be linked to directly in some configurations, or using some env vars.

This is not a big deal, because temporarily we could hack around the cmake very easily. I am going to send a PR to fix this when I have time, probably next month.