Autotvm extract tasks from BERT model cause segmentation fault

boood15 · January 21, 2020, 5:36am

Hi, I try to tune the matmul op in BERT model, but I encountered segmentation fault when I run the following line tasks = autotvm.task.extract_from_program(sym["main"], target=target, params=params, ops=(relay.op.nn.batch_matmul,)) And using gdb it shows following message

#0  0x00007fffe73872cf in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#1  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#2  0x00007fffe74755ea in tvm::relay::WellFormedChecker::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#3  0x00007fffe751aee0 in tvm::relay::ExprVisitor::VisitExpr_(tvm::relay::CallNode const*) () from /media/disk/DL/tvm/build/libtvm.so
#4  0x00007fffe73872a2 in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#5  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#6  0x00007fffe747654e in tvm::relay::WellFormedChecker::VisitExpr_(tvm::relay::FunctionNode const*) () from /media/disk/DL/tvm/build/libtvm.so
#7  0x00007fffe73872a2 in tvm::relay::ExprFunctor<void (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#8  0x00007fffe751e2ab in tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#9  0x00007fffe74755ea in tvm::relay::WellFormedChecker::VisitExpr(tvm::RelayExpr const&) () from /media/disk/DL/tvm/build/libtvm.so
#10 0x00007fffe751aee0 in tvm::relay::ExprVisitor::VisitExpr_(tvm::relay::CallNode const*) () from /media/disk/DL/tvm/build/libtvm.so

I also see that this repo https://github.com/icemelon9/bert-benchmark provides the autotvm log for matmul on AVX2.0, so although the author didn’t provide the code, tuning BERT should not have problem.

heliqi · February 18, 2020, 8:38am

@boood15 Do you solve the problem? I did too.

But I use the inceptionv3 , It worked on another machine (my mac notebook) and error on the Linux server.

I think my Linux Server env is problematic ，But haven’t find what the problem is.

comaniac · February 18, 2020, 9:29pm

The log you posted doesn’t show the problem. One known issue to cause seg fault when extracting tasks from a model is stack overflow (ref: Stack overflow in task extraction). You may want to check for it first.

jmorrill · February 18, 2020, 10:25pm

I never made PR for it because I assumed it was a “Windows” problem as others on Linux couldn’t repo.

Here is my change, line 162 - 166, that fixed the stack overflow problem.

github.com

jmorrill/tvm/blob/8be20b4f991dd6eb59cec9c978140769d7f44a72/python/tvm/autotvm/task/relay_integration.py#L162


    for mod, param in zip(mods, params):
        if isinstance(mod, relay.expr.Function):
            mod = relay.Module.from_expr(mod)
        assert isinstance(mod, relay.module.Module), \
            "only support relay Module or Function to be tuned"
        relay.backend.compile_engine.get().clear()
        # wrap build call in thread to avoid multiprocessing problems
        build_thread = threading.Thread(target=_lower,
                                        args=(mod, target, param))
        # Stack would overflow on some platforms (Windows) on some models
        old_stack_size = threading.stack_size(1024 * 1024 * 3)
        build_thread.start()
        build_thread.join()
        # Restore stacksize to original
        threading.stack_size(old_stack_size)


    logger.disabled = old_state


# convert *topi op to template key* map to *task name to template key* map
task_name_to_keys = {}
if template_keys is not None:

If this fixes the poster’s issue, I’ll gladly throw in a PR.

comaniac · February 18, 2020, 10:28pm

We actually observed the same problem on Linux platforms, but changing stack size seems not working at least from our side. Our workaround was falling back the task extraction to use graph runtime.

heliqi · February 19, 2020, 12:16am

@jmorrill @comaniac I change the stack size, but don’t fix this problem on linux.

I use ipdb , segmentation fault run build_thread.start() on python/tvm/autotvm/task/relay_integration.py. When I use ‘s’ enter the function, stuck in it and has been waiting for…

How do I make sure it’s a stack overflow？ (ref: Stack overflow in task extraction )print error ‘OSError: exception: stack overflow’, That doesn’t happen .

comaniac · February 19, 2020, 12:21am

I don’t remember seeing any messages other than segmentation fault when stack overflow on Linux. I think your case is stack overflow because the partial stack dump you posted was all recursive calls.

cc @haichen

heliqi · February 19, 2020, 1:38am

I think so, see ‘bt’ on gdb.

Is there a solution to this problem？

#-----------------------

I will withdraw the version to 0.6 for the time being, don’t use ‘The Relay VM’ and it’s working…

boood15 · February 19, 2020, 2:44am

Actually, I import the model from ONNX before, and after using ONNX-simplifier and reduce the number of layers in BERT, the problem seems solved. Another merit is that the relay loading speed is also improved.

heliqi · February 19, 2020, 3:14am

I try transforming the model from tensorflow to onnx later and using onnx-simplifier. I do not know whether the number of layers can be reduced in inceptionv3.

Is there a simplifier like onnx-simplifier on the tensorflow ?

lsy643 · March 16, 2020, 3:04am

I have met the same problem, and change the threading stack_size to (1024*1024*128) rather than (1024*1024*3), and then it works. Perhaps you can try a bigger stack_size

heliqi · March 16, 2020, 3:26am

your env is windows？ my env is linux, it doesn’t work. I can use ulimit to solve the problem on some machines, but not others.

lsy643 · March 16, 2020, 6:45am

I use ubuntu 16.04 and it is linux

heliqi · March 16, 2020, 8:27am

I try to change stack_size to bigger and it worked really well.

comaniac · March 16, 2020, 4:21pm

The PR being reviewed should fix stack overflow problem for models without control flow: https://github.com/apache/incubator-tvm/pull/5019