Stack overflow in task extraction

jmorrill · January 10, 2020, 9:18pm

In keeping my fork updated, the latest from master is giving me a stack overflow in the autotvm.task.extract_from_program

I believe it came from this checkin but not 100%:

I’m more than happy to debug and give more info, I just need to a hint of where to start putting the breakpoints

execution path.

[13:07:43] C:\Jenkins\workspace\mxnet-tag\mxnet\src\nnvm\legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[13:07:43] C:\Jenkins\workspace\mxnet-tag\mxnet\src\nnvm\legacy_json_util.cc:217: Symbol successfully upgraded!
extract_from_program
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Users\jeremiah.morrill\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\jeremiah.morrill\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\src\mediatubes\ml\tensortubes\src\third_party\tvm\.site-packages\python\tvm\autotvm\task\relay_integration.py", line 54, in _lower
    compiler.lower(mod, target=target)
  File "C:\src\mediatubes\ml\tensortubes\src\third_party\tvm\.site-packages\python\tvm\relay\backend\vm.py", line 455, in lower
    self._lower(mod, target, target_host)
  File "C:\src\mediatubes\ml\tensortubes\src\third_party\tvm\.site-packages\python\tvm\_ffi\_ctypes\function.py", line 206, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)) != 0:
OSError: exception: stack overflow

haichen · January 10, 2020, 11:36pm

@jmorrill Thanks for reporting the error. Would you mind share your script to reproduce the error?

jmorrill · January 11, 2020, 12:08am

I changed to a different model and I do not get the stackoverflow exception, which worked with a prior commit.

The model that stack overflows is here (mxnet). My get_network below should work with this model. Hopefully you can reproduce.

The script is the basic one from here with these parameters/changes:

def get_network(name, batch_size):
    input_shape = (batch_size, 3, 112, 112)
    output_shape = (batch_size, 1000)
    prefix,epoch = "../../arcface/model", 0
    sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
    dtype='float32'
    mod, params = relay.frontend.from_mxnet(sym, shape={'data': input_shape}, dtype=dtype, arg_params=arg_params, aux_params=aux_params)

    return mod, params, input_shape, output_shape

tuning_option = {
        'log_filename': log_file,
        'tuner': 'xgb',
        'n_trial': 8000,
        'early_stopping': 600,
        'measure_option': autotvm.measure_option(
            builder=autotvm.LocalBuilder(n_parallel=18, timeout=25, build_func=cc.create_shared),
            #runner=autotvm.LocalRunner(number=5, repeat=2, timeout=5, min_repeat_ms=100),
            runner=autotvm.RPCRunner(
                '1080ti',  # change the device key to your key
                '10.1.30.5', 9190,
                n_parallel=1,
                number=20, repeat=3, timeout=4, min_repeat_ms=150)
        ),
    }

haichen · January 13, 2020, 7:12pm

@jmorrill I didn’t get the stack overflow error. I tried both llvm and cuda target, but neither got the error. Probably it’s because I’m using linux environment. I don’t have a windows working environment. Would you mind debug into this a little bit and check which pass runs into stack overflow?

jmorrill · January 13, 2020, 8:47pm

Thank you so much for looking into this. Seeing as you determined it to be a platform difference (and probably not a code error), I tried this hack in relay_integration.py:

threading.stack_size(1024 * 1024 * 3)
build_thread.start()
threading.stack_size(0) # Zero is platform default?
build_thread.join()

This seems to fix my issue! Any suggestions on something better? Would make it a PR if so.

Thanks again for your help!!

haichen · January 13, 2020, 9:55pm

Thanks for debugging into this. Could you elaborate this a bit more? So you manually change the stack size before starting the thread. But why do you change the stack size back to 0 before join?

jmorrill · January 13, 2020, 10:20pm

But why do you change the stack size back to 0 before join

I could be wrong, but the threading.stack_size(...) seems to be a be a global function/value…in which case it may be rude to change this value globally.

If I’m reading the docs correctly, “0” means “use some default”.

Return the thread stack size used when creating new threads. The optional size argument specifies the stack size to be used for subsequently created threads, and must be 0 (use platform or configured default) or a positive integer value of at least 32,768 (32 KiB). If size is not specified, 0 is used.

Maybe a more robust change would be this…in case someone changed it previously?

old_stack_size = threading.stack_size(1024 * 1024 * 3)
build_thread.start()
threading.stack_size(old_stack_size)
build_thread.join()

In my test, old_stack_size returns as 0.

haichen · January 14, 2020, 6:29pm

Yes, that looks good to me. Could you send a PR to fix this?

Also I’d suggest you move line that changes back to old stack size after the thread is joined.