Relay Alter OP Layout Pass Regression


#1

x86 autotvm tutorial now has lower inference performance than using nnvm ir. The reason is that some incorrect workloads are generated when calling alter_op_layout pass:

WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('conv2d', (1, 64, 56, 56, 'float32'), (320, 64, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('conv2d', (1, 256, 56, 56, 'float32'), (640, 256, 1, 1, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('conv2d', (1, 512, 28, 28, 'float32'), (1280, 512, 1, 1, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('conv2d', (1, 1024, 14, 14, 'float32'), (2560, 1024, 1, 1, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.

These workloads shouldn’t appear in resnet50.
In https://github.com/dmlc/tvm/blob/master/topi/python/topi/x86/conv2d.py#L289, kernel has incorrect shape for these four workloads. It looks like tinfo is not generate correctly in alter_op_layout pass?

@merrymercy @yzhliu


#2

@kevinthesun This is weird, could you help to debug a little bit more,


When ref_call is convolution, what are the shapes of these new_args?


#3

Sure. I’ll take a look at it.


#4

is it CombineParallelConv2D pass? if this caused performance regression, we can disable it in the two-branch case


#5

@vinx13 Yes. I just confirmed that CombineParallelConv2D causes this problem. Can we disable it from frontend?


#6

Before disabling things by default, would be really nice to see if we can get a compatible way to make these optimizations play along with each other


#7

I think we can add a check to only disable the resnet case


#8

I think adding such a check might not be the long term solution. I haven’t tested other networks, but they can have similar issues like resnet. If this is the issue from CombineParallelConv2D itself, it would be really nice to debug and fix it.


#9

CombineParallelConv2D indeed introduces new workload that need to be tuned. So the question is whether combining conv2d is beneficial. It is helpful for inception. Resnet might not benefit from this pass because it has two branches and doesn’t have combinable subsequent elemwise ops. A possible solution is to check the number of branches and only apply this pass if #branches > 2


#10

OK. We might want to solve two potential problems here:

  1. Check when to apply this pass depending on the input graph.(Number of branches?)
  2. As @tqchen mentioned, we need to consider how this pass can cooperate with other pass, such as AlterOpLayout. Both of these two passes substitute some subgraphs. One possible solution comes to my mind is to manually apply CombineParallelConv2D pass and get the modified graph before we do any tensor/graph tuning. In this case, we guarantee that all new workloads are tuned and graph tuning can be executed correctly.