[AutoTVM] XGTuner Segmentation Fault

jwfromm · July 16, 2019, 7:24pm

I’ve been doing autotuning using the XGTuner and have previously had no problems. After a rebase, however, I’ve been hitting a segfault consistently after 60 iterations of the first task. Other tuners such as the GATuner do not have this problem. The full backtrace of the fault is:

Task  1/42]  Current/Best:   15.97/ 145.79 GFLOPS | Progress: (60/200) | 88.46 s[Thread 0x7ffec0ffd700 (LWP 11925) exited]
[Thread 0x7ffe557fa700 (LWP 11926) exited]
[Thread 0x7ffe55ffb700 (LWP 11927) exited]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff9179ad32 in simplequeue_dealloc () at /build/python3.7-4r_osw/python3.7-3.7.4/Modules/_queuemodule.c:29
29	/build/python3.7-4r_osw/python3.7-3.7.4/Modules/_queuemodule.c: No such file or directory.
(gdb) bt
#0  0x00007fff9179ad32 in simplequeue_dealloc () at /build/python3.7-4r_osw/python3.7-3.7.4/Modules/_queuemodule.c:29
#1  0x000000000050da43 in ?? ()
#2  0x000000000052d4d7 in ?? ()
#3  0x0000000000610511 in ?? ()
#4  0x000000000050c43d in _PyObjectDict_SetItem ()
#5  0x000000000061348a in _PyObject_GenericSetAttrWithDict ()
#6  0x0000000000519822 in PyObject_SetAttr ()
#7  0x000000000055ba1a in _PyEval_EvalFrameDefault ()
#8  0x00000000004e9eda in _PyFunction_FastCallKeywords ()
#9  0x000000000055b330 in _PyEval_EvalFrameDefault ()
#10 0x00000000004e9eda in _PyFunction_FastCallKeywords ()
#11 0x000000000055b330 in _PyEval_EvalFrameDefault ()
#12 0x00000000004e9eda in _PyFunction_FastCallKeywords ()
#13 0x000000000055b330 in _PyEval_EvalFrameDefault ()
#14 0x000000000055a4a8 in _PyEval_EvalCodeWithName ()
#15 0x00000000004e9fb5 in _PyFunction_FastCallKeywords ()
#16 0x000000000055b330 in _PyEval_EvalFrameDefault ()
#17 0x000000000055a4a8 in _PyEval_EvalCodeWithName ()
#18 0x00000000004eab44 in _PyObject_Call_Prepend ()
#19 0x00000000004eadae in PyObject_Call ()
#20 0x000000000055c8b8 in _PyEval_EvalFrameDefault ()
#21 0x000000000055a4a8 in _PyEval_EvalCodeWithName ()
#22 0x00000000004e9fb5 in _PyFunction_FastCallKeywords ()
#23 0x000000000055bf52 in _PyEval_EvalFrameDefault ()
#24 0x000000000055a140 in _PyEval_EvalCodeWithName ()
#25 0x0000000000429457 in _PyFunction_FastCallDict ()
#26 0x000000000055c8b8 in _PyEval_EvalFrameDefault ()
#27 0x00000000004e9eda in _PyFunction_FastCallKeywords ()
#28 0x000000000055b0af in _PyEval_EvalFrameDefault ()
#29 0x000000000055a140 in _PyEval_EvalCodeWithName ()
#30 0x0000000000559ec3 in PyEval_EvalCode ()
#31 0x000000000062a7f2 in ?? ()
#32 0x000000000062ac5a in PyRun_FileExFlags ()
#33 0x000000000062aa17 in PyRun_SimpleFileExFlags ()
#34 0x00000000006040b5 in ?? ()
#35 0x0000000000603d3a in _Py_UnixMain ()
#36 0x00007ffff6cbd830 in __libc_start_main (main=0x4e4f30 <main>, argc=2, argv=0x7fffffffd618, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffd608) at ../csu/libc-start.c:291
#37 0x0000000000603c29 in _start ()

Does anyone know why this is happening and how to resolve it?

Wheest · March 16, 2020, 7:13pm

I have encountered a similar issue.

Running the HEAD (c6f8c23c3) of v0.6.

I was trying to do some autotuning, and found that one of my scripts that had worked previously was segfaulting.

No obvious debug info, just:

Tuning...                                                                                      
/tmp/[redacted].log.tmp                                                                          
/tmp/[redacted].log                                                                              
[Task  1/15]  Current/Best:   23.87/  23.87 GFLOPS | Progress: (1/1) | 3.27 sSegmentation fault

This is the only information that I got.

I recalled that a few days ago I had recompiled with the CMake flag USE_GRAPH_RUNTIME_DEBUG set to True. I tried a fresh compile without it.

No change, despite it being the same config that worked before.

Then I found this post.

I switched from the XGBTuner to the GATuner, and was able to run.

Unsure what other solutions to try. Right now, perhaps I’ll try using another system as my host machine to see if I can reproduce.

Unsure if one of my system packages have changed or something else…

jmorrill · March 16, 2020, 7:29pm

What version of XGBoost are you using? IIRC I had problems with v1. See if v0.9 works.

Wheest · March 17, 2020, 10:28am

Thanks though it seems I’m on v0.9 an’aw

>>> import xgboost
>>> print(xgboost.__version__)
0.90

Have got it running on another machine (so alas not my main workhorse). Same tvm commit version and config.cmake. On that system xgboost==1.0.2. I should really learn how to use Nix…