Autotuner fails for CNN in new dev branch

I’ve just moved from v0.6 to the development branch. HEAD at time of writing.

I’m trying to autotune one of my benchmarks, a CNN. It autotunes fine in v0.6, however with the dev branch, even with a couple of iterations it crashes with error:

[XX:XX:XX] ../src/printer/doc.cc:55: text node: ' an internal invariant was violated while typechecking your program [XX:XX:XX] ../src/relay/op/nn/convolution.cc:561: Check failed: param->ker
nel_size.defined() && param->channels.defined(): The kernel size and channels of a Conv must be set or infered by previous pass      

and with other possibly relevant information:

  %15 = nn.contrib_conv2d_winograd_without_weight_transform(%11, %14, tile_size=4, padding=[1, 1, 1, 1], kernel_size=[3, 3]) an internal invariant was violated while typechecking your program
 [XX:XX:XX] ../src/relay/op/nn/convolution.cc:561: Check failed: param->kernel_size.defined() && param->channels.defined(): The kernel size and channels of a Conv must be set or infered by previous pass                                                                                                                 

The model runs fine when not autotuning. I’ve tried a couple of other models and they seem to work. The debug doesn’t mention any other layers failing, but it could be this is just the first to fail.

My uninformed guess from the message is that somewhere along the way with nn.contrib_conv2d_winograd_without_weight_transform the kernel size or channels information is lost somewhere. However I’m not yet familiar enough with the C++ backend of tvm to know a good approach to start tracing.

I can provide more information about my setup if needed.

I have not jumped backwards through the commit history to see if there’s an earlier commit of this version that works.

(off-topic, but does anyone have a recommended nice automated way of doing this beyond bash scripting, I guess doing some fort of binary search in a range of commits?)

Could you share the script that can reproduce the error? I think it’s likely because you didn’t specify the weight shape to depthwise conv2d op.

An update with a reproducible example. Seems that it runs okay on x86 platforms, but on ARM CPU platforms I get the error described above.

Have tried again with a more recent commit, and the same issue occurs.

I’ve found it with two of my test models, which are in the repo as Onnx models. See here.

Two models, ResNet34, and the smaller WRN-40-2. They don’t have any depthwise conv2d ops as far as I’m aware.

You don’t need many trials to see the behaviour.

A couple of example runs for the example are below:

python tvm_autotune.py --target_type remote --device_key hikey --target_string 'llvm -device=arm_cpu -target=aarch64-linux-gnu' --output_path /tmp/ --host_port 9190  --trials 5

python tvm_autotune.py --target_type local  --target_string 'llvm -target=x86_64-linux-gnu -mcpu=core-avx2' --output_path /tmp/ --trials 5

Could you try apply this patch and see if it fixes the problem?

Thanks for getting back to me on this.

I’ve since updated to a newer commit (95e06b3). However, the autotuning on ARM CPU problem seems to have extended to all of my autotuning benchmarks.

I’ve tried applying your patch to 95e06b3, but it does not seem to change the behavior.

The autotuning process runs successfully.

However, when evaluating the final model, I get the following error host side:

Traceback (most recent call last):
  File "/home/[redacted]/tools/tvm/python/tvm/_ffi/_ctypes/ndarray.py", line 80, in __del__
    check_call(_LIB.TVMArrayFree(self.handle))
  File "/home/[redacted]/tools/tvm/python/tvm/_ffi/base.py", line 329, in check_call
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (7) /home/[redacted]/tools/tvm/build/libtvm.so(TVMArrayFree+0x23) [0x7fe4e0857263]
  [bt] (6) /home/[redacted]/tools/tvm/build/libtvm.so(tvm::runtime::RPCWrappedFunc::RemoteNDArrayDeleter(tvm::runtime::Object*)+0x1f) [0x7fe4e0887cbf]
  [bt] (5) /home/[redacted]/tools/tvm/build/libtvm.so(tvm::runtime::RPCClientSession::FreeHandle(void*, int)+0x7f) [0x7fe4e0881b1f]
  [bt] (4) /home/[redacted]/tools/tvm/build/libtvm.so(+0xd977c1) [0x7fe4e087c7c1]
  [bt] (3) /home/[redacted]/tools/tvm/build/libtvm.so(tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)+0x177) [0x7fe4e0879fe7]
  [bt] (2) /home/[redacted]/tools/tvm/build/libtvm.so(tvm::runtime::SockChannel::Send(void const*, unsigned long)+0x20) [0x7fe4e088f010]
  [bt] (1) /home/[redacted]/tools/tvm/build/libtvm.so(tvm::support::Socket::Error(char const*)+0xe6) [0x7fe4e088eef6]
  [bt] (0) /home/[redacted]/tools/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe4dfeff472]
  File "../src/runtime/rpc/../../support/socket.h", line 362
TVMError: Socket SockChannel::Send Error:Broken pipe

On the device side, I get the error:

malloc(): invalid size (unsorted)

I have tested this on several ARM devices, and network benchmarks. The autotuner works on the x86 platform running locally.

Searching the forum, the only related issue I can find is this one, however no resolution.

For completeness, I have also tried patch on the original commit in this issue (fdc8b0dd1763).

I confirm that in this case, yes the patch works. It seems that I have inadvertently walked into another issue with the newer commit.

Should I keep this thread for that issue, or create a new one?

Did you update the TVM on the device side as well? It might because the newer version is not compatible with old ones.

Good check, yes. Should have clarified I had done this.

I ensure that both device and host sides are compiled on the same commit (I’ve learned that lesson).

I’ve now applied the patch to an even more recent commit (0e877521f). I guess there was some error introduced somewhere leading up to and including 95e06b3 which was later corrected.

So to confirm, your patch works, but only on earlier and later versions of the repo which do not have other issues.

I’ve not delved too deeply into TVM testing infrastructure. Would this have been caught running a CI autotuning test on an ARM instance for a conv net? Are those currently covered, if so could you point me towards them?