TVM deadlocking when calling python PackedFunc from generated code

I’m having an interesting bug. In the custom datatypes framework, I’ve made it so that you’re able to implement your custom datatype operators in python by passing TVM some Python PackedFuncs which will be called in place of the operators. However, I’ve found that programs which use this functionality end up waiting on synchronization primitives when there are more than one thread:

((lldb) thread info all
thread #2: tid = 0x86e10, 0x00007fff6b9fd26e libsystem_kernel.dylib`swtch_pri + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

thread #3: tid = 0x86e29, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #4: tid = 0x86e2a, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #5: tid = 0x86e2b, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #6: tid = 0x86e2c, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #7: tid = 0x86e2d, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #8: tid = 0x86e84, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #9: tid = 0x86e85, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #10: tid = 0x86e86, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #11: tid = 0x86e87, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #12: tid = 0x86e88, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

thread #13: tid = 0x86ecb, 0x00007fff6ba00916 libsystem_kernel.dylib`__psynch_cvwait + 10

The interesting thing is that it gets through most of the work before hanging — for example, if i’m casting a vector of size 8 to a custom datatype, it will get through 7 of the casts before locking up.
I’m trying to debug this. Because it seems to be locking up in synchronization primitives, I was under the assumption that it’s not a problem with the generated code, but with the surrounding runtime system which waits for the code to complete. Any guidance or suggestions in debugging would be much appreciated!

For background: the custom datatype framework I’m working on allows users to register new datatypes, and then register operator implementations over those datatypes. The above code I’m referring to allows users specifically to implement their operators in Python. For example, I might make a mytype datatype, and create its add operator in Python:

def mytypeAdd(a, b):
  return a + b

This function will get wrapped into a PackedFunc. When TVM compiles the code and sees an add of two mytypes, it will compile this into a call_packed_lowered intrinsic which calls the Python PackedFunc implementing the add.

It might have to do with a current limitation that tvm’s PackedFunc call in the generated code is not thread-safe, due to the fact that we are allocating temp arrays for the call stack and return arguments and they were allocated in the beginning of the body, instead of the location of the parallel loop. https://github.com/apache/incubator-tvm/blob/master/src/pass/lower_tvm_builtin.cc

See what will happen when you disable the parallel construct. We should also look into this issue and tries to fix the codegen issue of PackedFunc call

Yeah, I should have noted that this does NOT happen when I set TVM_NUM_THREADS to 1. Seems like you’re probably right.

I can write this up as an issue on GitHub, so we can open discussion about how the codegen issue might be fixed? Or is there already a relevant issue?

An issue would be great