TVM-generated assembly code reuse Seg Fault: TVM built with OpenMP option 'intel'

moderato · February 15, 2020, 8:16am

Hello! I’m trying to reuse the assembly code generated by TVM with my own C++ code and I get a segmentation fault. I use ‘intel’ as the OpenMP option when I built TVM. The command I use to build my own code is like:

as kernel.asm -o kernel.o
g++ kernel.o -shared -fPIC -m64 -o kernel.so
g++ -std=c++11 -O2 -fPIC \
    -I/home/moderato/Documents/incubator-tvm/include \
    -I/home/moderato/Documents/incubator-tvm/3rdparty/dmlc-core/include \
    -I/home/moderato/Documents/incubator-tvm/3rdparty/dlpack/include \
    -L/home/moderato/Documents/incubator-tvm/build \
    -L/usr/local/lib/ -liomp5 \
	cpu_bench.cpp \
	-o cpu_bench \
	-ldl -pthread -lcnpy -lz -ltvm_runtime

When I trace the error down using gdb, I got some output like this:

#0  0x00007fb27c628293 in ?? () from kernel.so
#1  0x00007fb299d3a7e2 in TVMBackendParallelLaunch._omp_fn.0 () from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#2  0x00007fb27d85d638 in __kmp_api_GOMP_parallel_40_alias () from /usr/local/lib/libiomp5.so
#3  0x00007fb299d3a935 in TVMBackendParallelLaunch () from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#4  0x00007fb27c627e71 in ?? () from kernel.so
#5  0x00007fb299d1fea0 in tvm::runtime::WrapPackedFunc(int (*)(void*, int*, int), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const [clone .isra.88] () from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#6  0x0000000000402762 in benchmark_generated_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, int, int, int, int, bool, int, int, int, int, bool, int, bool, bool) ()
#7  0x00000000004032cd in benchmarkWithWorkloadString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) ()
#8  0x0000000000401c3e in main ()

Looks like the error occurs when TVMBackendParallelLaunch tries to launch some function, but I have no idea how to go one step deeper. Anyone can help? Did I miss anything when I compile and link the code?

Thanks in advance!

moderato · February 15, 2020, 8:18am

@haichen Can you take a look at this question? Thanks!

haichen · February 15, 2020, 6:11pm

Have you tried this with TVM’s own thread pool instead of OpenMP?

moderato · February 15, 2020, 7:49pm

Yes that’s what I originally tried (both TVM’s thread pool and the ‘gnu’ option). However the CPU kernel doesn’t have a stable execution time, i.e. in one of my test cases it’s sometimes ~400us but sometimes ~700us, even after I set up the env vars with this general practice:

export KMP_BLOCKTIME=1
export KMP_AFFINITY=verbose,granularity=fine,compact,1,0
export OMP_NUM_THREADS=4
export TVM_NUM_THREADS=4

So I simply switch to ‘intel’ for OpenMP and with the setting above the execution time can be stabilized to ~400us. And then I have this bug. Any ideas? Did I miss anything here?

haichen · February 17, 2020, 11:01pm

I don’t have a good idea. Maybe you can use Debug mode to compile TVM and potentially it could provide more stack information.

FrozenGene · February 18, 2020, 5:30am

About thread pool:

1 do you try the latest tvm? Previously, we fix one issue in thread pool of cpu affinity.

if option 1 doesn’t work, then:

2 Could you also try export TVM_BIND_THREADS = 4?

moderato · February 19, 2020, 1:14am

I just updated to the latest TVM and it runs quite well and has a slightly better execution time than either “gnu” or “intel”. My current env vars setting is like:

export KMP_BLOCKTIME=1
export KMP_AFFINITY=verbose,granularity=fine,compact,1,0
export OMP_NUM_THREADS=4
export TVM_BIND_THREADS=4

In one of my test cases, the execution time with TVM thread pool is ~250us, while that “gnu” and “intel” are both ~260us. The runtime fluctuation is kinda relieved after I measure the “delta” execution time instead, i.e. run the kernel for N iterations to get T1, and 2N iterations to get T2, so that the actual execution time is (T2 - T1) / N, which is less likely to be affected by any warm-up effects. The following link claims that warm-up effects exist in AVX instruction set, although it’s not officially documented:

https://www.agner.org/optimize/blog/read.php?i=165

(It’s under the title “AVX instruction set” on the page.)

Here I come up with a few questions:

Does the current env vars setting look good to you? In some other posts the variable TVM_NUM_THREADS are mentioned. Does it have the same effect as TVM_BIND_THREADS?
What’s the best practice for AutoTVM with multithreading? Currently, I set the above env vars before setting up the RPC_tracker and RPC_server, and I also put numactl -l -C 0-3 in front of the tracker, server and autotvm commands, i.e. numactl -l -C 0-3 python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190 Does this look like the best practice?
Back to the question I initially asked: I still have same bug when not using ‘intel’ for multithreading, while the gdb error messages are a bit different:

#0  0x00007f320c45216b in ?? () from kernel.so
#1  0x00007f322b2a545e in tvm::runtime::ThreadPool::Launch(int (*)(int, TVMParallelGroupEnv*, void*), void*, int, int) ()
   from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#2  0x00007f322b2a2d53 in TVMBackendParallelLaunch () from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#3  0x00007f320c451e71 in ?? () from kernel.so
#4  0x00007f322b280aef in tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const [clone .isra.116] ()
   from /home/moderato/Documents/incubator-tvm/build/libtvm_runtime.so
#5  0x00000000004030fa in benchmark_generated_cpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, int, int, int, int, bool, int, int, int, int, bool, int, bool, bool) ()
#6  0x0000000000403f1d in benchmarkWithWorkloadString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) ()
#7  0x0000000000401cce in main ()

Any idea how to dig deeper and pin point the problem?

Thank you for your time reading through such a looong post!

FrozenGene · February 19, 2020, 3:23am

Does the current env vars setting look good to you? In some other posts the variable TVM_NUM_THREADS are mentioned. Does it have the same effect as TVM_BIND_THREADS?

Could you try to remove TVM_BIND_THREADS and observe the performance again? It is not the same affect as TVM_NUM_THREADS. If we don’t set TVM_BIND_THREADS or we set TVM_BIND_THREADS but it is not 1, we will not enter into the SetAffinity function. In fact, previous post I mentioned is I have fixed the logic SetAffinity in the latest TVM master.

What’s the best practice for AutoTVM with multithreading? Currently, I set the above env vars before setting up the RPC_tracker and RPC_server, and I also put numactl -l -C 0-3 in front of the tracker, server and autotvm commands, i.e. numactl -l -C 0-3 python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190 Does this look like the best practice?

In fact, I just use python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190. I don’t observe any issue using this command. I think you could use this command directly if you use latest TVM master. TVM master should fix the affinity issue between children and parent process when we use AutoTVM.

Back to the question I initially asked: I still have same bug when not using ‘intel’ for multithreading, while the gdb error messages are a bit different:

From the stacktrace, we can not grasp significant message except we know it happened in TVMBackendParallelLaunch. I will encourage you use TVM thread pool if don’t have other reasons using OpenMP (for example, you want to interact with other components which have used OpenMP).

moderato · February 19, 2020, 2:30am

Removing TVM_BIND_THREADS results in very slight perf drop (~10 us in my ~260 us case). I don’t see any big differences whether or not setting affinity as I did before. I suppose TVM works quite well here.

OK, I’m fine with your suggestion given the above preliminary result. Previously my major concern is the multithreading execution time is not stable but now looks like it’s not a problem anymore with the latest TVM master.

Sure, I’ll try other ways to debug.

Thank you very much for your help on this topic!

moderato · March 2, 2020, 8:25am

Problem solved. The seg fault is actually caused by the way I passed the input data from cnpy objects to DLTensors. Using memcpy solves the problem.