How to fuse conv2d and following elemwise op?


#1
# feature format: NCHW
feature = tvm.placeholder((16, 3, 244, 244), name="feature")

# kernel format : output_channel, input_channel, kernel_height, kernel_width
kernel = tvm.placeholder((10, 3, 16, 16), name="kernel")

output_data = topi.nn.conv2d(feature, kernel, 1, 0)

relu_result = topi.nn.elemwise.relu(output_data)

s = tvm.create_schedule(output_data.op)

print(tvm.lower(s, [feature, kernel], simple_mode=True))

Code like this will generate two ‘produce’ blocks, is it possible to fuse them through primitive API?


#2

Yes, the key is to use compute_at(…). For example, x86 schedule uses it here to fuse convolution and the following operation (bias add, batch norm, relu).


#3

Thanks for the tip, I have figured out how to use compute_at() to fuse conv2d and elemwise op.

Now I am trying to fuse next conv2d, but tvm reports ‘Invalid schedule’

# feature format: NCHW
feature = tvm.placeholder((16, 3, 244, 244), name="feature")

# kernel format : output_channel, input_channel, kernel_height, kernel_width
kernel = tvm.placeholder((10, 3, 16, 16), name="kernel")

output_data = topi.nn.conv2d(feature, kernel, 1, 0)

relu_result = topi.nn.elemwise.relu(output_data)

conv2_result = topi.nn.conv2d(relu_result, kernel, 1, 0)

ss = tvm.create_schedule(conv2_result.op)

ss[output_data].compute_at(ss[relu_result], relu_result.op.axis[3])

ss[relu_result].compute_at(ss[conv2_result], conv2_result.op.axis[0])

print(tvm.lower(ss, [feature, kernel], simple_mode=True))

Is it feasible? Or I have to split input feature manually?

Thanks a lot


#4

Fusing multiple convolutions is not possible.


#5

Why fuse multiple convolution is impossible?


#6

Imagine how you would implement fused convolution. Let’s say we target GPU. Before you can start the second convolution on a single pixel, you have to wait neighbor pixels to finish their first convolution. This requires global sync at shared memory boundary. Since we need to store the output of the first convolution to the global memory, we don’t have any benefit from fusing.

For other architectures it might be doable, but at least in TVM we don’t fuse consecutive convolutions.


#7

I would probably differentiate between

  1. NNVM (at least v1) had fusion rules which prevented an automatic fusioning (at NNVM level) of two neighbouring convolution layers, therefore all automatic generated TVM “tasks” (i.e. composition of stages) had only one conv layer
  2. It is not possible to generate TVM tasks which describe two neighbouring convs
  3. It is not possible to use TVM scheduling primitives to fuse (i.e. tvm.compute_at) two convs

AFAIK

  1. Is true but is a limitation posed from how NNVM (v1?) was used during operator fusion.
  2. Is false. You can check by defining two tvm.compute which describe two conv2ds and use tvm.lower go get a printout
produce conv1_res{
//code which implements conv2d goes here
}
produce conv2_res{
//code which implements conv2d goes here
}
  1. Is undefined (I havent tried it). Conceptually, I think it is possible since there is an obvious produce consumer relation and the tensors shape relations are also known.

#8

You mention that it requires a global sync, however, the sync is not necessary when we have redundant computations.
There are many examples in the Halide papers.
In fact, where and when to compute the pixels can bring different trade-off between producer-consumer locality, input locality, and redundant computation.
In my opinion, it can also bring a larger exploration space for performance tuning, when fusing two convs.
I think TVM is capable of generating the code that fuses two convs, but it cannot now.