Autotvm extract tasks from BERT model cause segmentation fault

Hi, I try to tune the matmul op in BERT model, but I encountered segmentation fault when I run the following line tasks = autotvm.task.extract_from_program(sym["main"], target=target, params=params, ops=(relay.op.nn.batch_matmul,)) And using gdb it shows following message

#0  0x00007fffe73872cf in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#1  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#2  0x00007fffe74755ea in tvm::relay::WellFormedChecker::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#3  0x00007fffe751aee0 in tvm::relay::ExprVisitor::VisitExpr_(tvm::relay::CallNode const*) () from /media/disk/DL/tvm/build/libtvm.so
#4  0x00007fffe73872a2 in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#5  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#6  0x00007fffe747654e in tvm::relay::WellFormedChecker::VisitExpr_(tvm::relay::FunctionNode const*) () from /media/disk/DL/tvm/build/libtvm.so
#7  0x00007fffe73872a2 in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#8  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#9  0x00007fffe74755ea in tvm::relay::WellFormedChecker::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#10 0x00007fffe751aee0 in tvm::relay::ExprVisitor::VisitExpr_(tvm::relay::CallNode const*) () from /media/disk/DL/tvm/build/libtvm.so

I also see that this repo https://github.com/icemelon9/bert-benchmark provides the autotvm log for matmul on AVX2.0, so although the author didn’t provide the code, tuning BERT should not have problem.

@boood15 Do you solve the problem? I did too.

But I use the inceptionv3 , It worked on another machine (my mac notebook) and error on the Linux server.

I think my Linux Server env is problematic ,But haven’t find what the problem is.

The log you posted doesn’t show the problem. One known issue to cause seg fault when extracting tasks from a model is stack overflow (ref: Stack overflow in task extraction). You may want to check for it first.

I never made PR for it because I assumed it was a “Windows” problem as others on Linux couldn’t repo.

Here is my change, line 162 - 166, that fixed the stack overflow problem.

If this fixes the poster’s issue, I’ll gladly throw in a PR.

We actually observed the same problem on Linux platforms, but changing stack size seems not working at least from our side. Our workaround was falling back the task extraction to use graph runtime.

@jmorrill @comaniac I change the stack size, but don’t fix this problem on linux.

I use ipdb , segmentation fault run build_thread.start() on python/tvm/autotvm/task/relay_integration.py. When I use ‘s’ enter the function, stuck in it and has been waiting for…

How do I make sure it’s a stack overflow? (ref: Stack overflow in task extraction )print error ‘OSError: exception: stack overflow’, That doesn’t happen .

I don’t remember seeing any messages other than segmentation fault when stack overflow on Linux. I think your case is stack overflow because the partial stack dump you posted was all recursive calls.

cc @haichen

I think so, see ‘bt’ on gdb.

Is there a solution to this problem?

#-----------------------

I will withdraw the version to 0.6 for the time being, don’t use ‘The Relay VM’ and it’s working…

Actually, I import the model from ONNX before, and after using ONNX-simplifier and reduce the number of layers in BERT, the problem seems solved. Another merit is that the relay loading speed is also improved.

I try transforming the model from tensorflow to onnx later and using onnx-simplifier. I do not know whether the number of layers can be reduced in inceptionv3.

Is there a simplifier like onnx-simplifier on the tensorflow ?

I have met the same problem, and change the threading stack_size to (1024*1024*128) rather than (1024*1024*3), and then it works. Perhaps you can try a bigger stack_size

your env is windows? my env is linux, it doesn’t work. I can use ulimit to solve the problem on some machines, but not others.

I use ubuntu 16.04 and it is linux

I try to change stack_size to bigger and it worked really well.

The PR being reviewed should fix stack overflow problem for models without control flow: https://github.com/apache/incubator-tvm/pull/5019