Grouped convolution performance penalty

Hello there. I am looking at grouped convolution, and am incurring a massive performance penalty when using it.

In the Relay interface, there are different implementations for conv layers when groups==1, and when groups>1.

Ideally, we would hope for a speedup 2x if we switch from normal convolutional layer, to a grouped convolution with groups==2. However, on several platforms I have tried, there is a ~4x slowdown.

I have been looking into the tvm implementation, but am not yet familiar enough with the design.

Any insights into why the penalty is happening?

You can see this notebook which demonstrates the slowdown.

group conv on llvm target only has a default schedule. It can be improved by implementing a new AutoTVM schedule template

Thank you for the pointer.

As I understand it, schedules are stored in topi/python/topi/. And the schedule for conv2d on x86 is in topi/python/topi/x86/conv2d.py.

I’ve been looking at the tutorial Introduction to TOPI, but I’m still trying to understand how to use the various tvm Python decorators to use schedules that I write.

I’m looking to see if I can add x86 and arm_cpu schedules for group_conv2d_nchw.

I imagine I should add a group_conv2d.py to both topi/python/topi/x86 and topi/python/topi/arm_cpu. However, is there anything else that is essential, or docs that might be helpful?

Should my decorators in those group_conv2d.py files be:

@autotvm.register_topi_compute(group_conv2d_NCHW, 'cpu', 'direct')

@autotvm.register_topi_compute(group_conv2d_NCHW, 'arm_cpu', 'direct')

And is that sufficient for those scheduled to automatically used by tvm?

Or, perhaps a simpler question:

In my explorations, I have tried to force usage of topi.nn.group_conv2d_nchw when groups==1 (by commenting out the first if statement case in python/tvm/relay/op/nn/_nn.py:compute_conv2d).

However, the creation of the tvm.compute definition in topi/python/topi/nn/conv2d.py:group_conv2d_nchw fails with RuntimeError("Cannot find workload in attribute of this schedule").

There’s not anything in this function that I can see that would cause this. Unless the tag “tag=‘group_conv2d_nchw’” in the tvm.compute definition is getting picked up somewhere.

Any suggestions or intuitions of how this works?

For the compute part, add autotvm.register_topi_compute(nn.group_conv2d_nchw, ['cpu'], 'direct', nn.group_conv2d_nchw.fdefault or @autotvm.register_topi_compute(group_conv2d_NCHW, 'cpu', 'direct') if you have a custom compute function.

For the schedule part, add
@autotvm.register_topi_schedule(generic.schedule_group_conv2d_nchw, ['cpu'], ['direct]) to your schedule function.

The error comes from https://github.com/dmlc/tvm/blob/f2ddb1961c2dc4181b399093ce698cbf0dd6536d/python/tvm/autotvm/task/topi_integration.py#L433
In AutoTVM, both compute and schedule are decoratored functions. The schedule function will try to find the workload from the input. The workload is registered in https://github.com/dmlc/tvm/blob/f2ddb1961c2dc4181b399093ce698cbf0dd6536d/python/tvm/autotvm/task/topi_integration.py#L351

How to improve the grouped convolution performance? Now it is much slower than mxnet inference.
tvm time is 570ms and mxnet is 45ms for a resnet50 using group conv on x86.

You can copy the schedule for conv2d, this will at least bring some improvement. You need to write specific schedule template for group_conv2d if you want further improvement

How to use the schedule of conv2d for group_conv2d?

schedule for conv2d is here https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_direct.py

group conv2d https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/group_conv2d_nchw.py

Do you solve the problem. I got the same problem that when I set group =2 for group convolution, the speed is much slower then group = 1