Unable to compile ResNet-18 with different input shape for CUDA target

I am trying to run a ResNet-18 model on a Jetson TX2 with an input data shape of (1, 3, 228, 304) instead of (1, 3, 224, 224). I am cross-compiling and using target="cuda" and target_host="llvm -target=aarch64-linux-gnu". Both the host machine on which I’m compiling and the TX2 are running cuda-8.0 and llvm-4.0.

When I try to compile the ResNet-18 model with the input data shape of (1, 3, 228, 304). I get the following error:

tvm._ffi.base.TVMError: [16:00:46] /home/dwofk/tvm/src/schedule/message_passing.cc:36: Check failed: match iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x) domain already inferred, cannot prove their extents are the same 5 vs 8

Error during compile graph
--------------------------
Graph(%input0,
      %input1,
      %input2,
      %input3,
      %input4) {
  %input0, shape=[1,128,29,38]
  %input1, shape=[128,128,3,3]
  %input2, shape=[128,1,1]
  %input3, shape=[128,1,1]
  %input4, shape=[1,128,29,38]
  %2 = conv2d(%input0, %input1, channels='128', use_bias='False', strides='(1, 1)', kernel_size='(3, 3)', dilation='(1, 1)', groups='1', padding='(1, 1)'), shape=[1,128,29,38]
  %4 = broadcast_mul(%2, %input2), shape=[1,128,29,38]
  %6 = broadcast_add(%4, %input3), shape=[1,128,29,38]
  %8 = elemwise_add(%6, %input4), shape=[1,128,29,38]
  %9 = relu(%8), shape=[1,128,29,38]
  ret %9
}
graph_attr_keys = [shape, shape_num_unknown_nodes, dtype, dtype_num_unknown_nodes]

I noticed that this issue is similar to the one in https://github.com/dmlc/nnvm/issues/239. That issue was resolved by adding an extra elif condition in the conv2d_56_64_128 schedule. I attempted to resolve my own issue by adding a similar extra elif condition right below:

if mark % 8 == 0 and mark % 7 == 0:
    num_thread_x = 8
    vthread_x = 7
elif mark % 4 == 0 and mark % 7 == 0:
    num_thread_x = 4
    vthread_x = 7
elif mark % 2 == 0 and mark % 19 == 0:
    num_thread_x = 2
    vthread_x = 19

This seems to resolve my issue. I am wondering if this was the correct approach. Or is there a more optimal solution?

As @merrymercy noted in the previous thread, this is due to the current static nature of our CUDA schedules. We will add templates soon that address this issue. As for your workaround, this works, but the performance may not be great.