Padded conv using tvm.select

I am trying to write a conv_padded operator i.e. instead of padding the input image, we keep the image size same but involve ‘if’ conditions to compute the output.

Example of 1D padded conv

for(ow = 0, 16){
  for(k = 0, 3){
    if(ow + k - 1 >= 0 && ow + k - 1 < 16)
       conv_padded[ow] += data[ow + k - 1] * kernel[k];

The problem is to represent that ‘if’ condition. I don’t want to use ir_builder because that prevents us from the schedule optimizations like tiling, split.

I tried using tvm.select but it gets into infinite loop

def test_padded_conv():
    data = tvm.placeholder((16, ), name='data')
    kernel = tvm.placeholder((3, ), name='kernel')
    k = tvm.reduce_axis((0, 3), name='kh')
    # Unpadded conv looks something like this
    # conv_padded = tvm.compute((16, ),
    #                           lambda oh: tvm.sum(data[oh + k - 1] * kernel[k],
    #                                              axis=[k]),
    #                           name="conv_padded")
    conv_padded = tvm.compute((16, ),
       lambda oh: tvm.sum(tvm.select( (oh+k-1>=0),  (data[oh + k - 1]* kernel[k]), 0),
                          axis=[k]),
       name="conv_padded")
    s = tvm.create_schedule(conv_padded.op)
    print(tvm.lower(s, [data, kernel], simple_mode=True))

Error is

Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <object repr() failed> ignored

Any ideas on how to solve this?

I think if you add explicit padding and do compute_inline on it, you get what you want. Have you tried that?

@masahi In case of explicit padding, we will be making the input matrix larger or creating an intermediate matrix with padded zeros. The usecase I am interested is in like VTA/FPGA, which does not have a way of explicitly inserting zeros. Currently, padding is done on host in the case of VTA. I am interested in writing a kernel that can do padded_conv on the FPGA itself.

yes, but if you do compute_inline() on the padding stage, you don’t allocate another bigger matrix. The padding logic will be embedded inside convolution inner loop. Here is an example of its use by cuda backend.

@masahi This is awesome. Thanks!

Sharing the results.

Explicit padding without compute_inline

produce data_pad {
  for (i0, 0, 18) {
    data_pad[i0] = tvm_if_then_else(((1 <= i0) && (i0 < 17)), data[(i0 + -1)], 0.000000f)
  }
}
produce conv_padded {
  for (oh, 0, 16) {
    conv_padded[oh] = 0.000000f
    for (kh, 0, 3) {
      conv_padded[oh] = (conv_padded[oh] + (data_pad[(oh + kh)]*kernel[kh]))
    }
  }
}

With compute_inline

produce conv_padded {
  for (oh, 0, 16) {
    conv_padded[oh] = 0.000000f
    for (kh, 0, 3) {
      conv_padded[oh] = (conv_padded[oh] + (tvm_if_then_else((((1 - kh) <= oh) && (oh < (17 - kh))), data[((oh + kh) + -1)], 0.000000f)*kernel[kh]))
    }
  }
}

The python schedule

def test_padding():
    data = tvm.placeholder((16, ), name='data')
    kernel = tvm.placeholder((3, ), name='kernel')
    data_pad = pad(data, (1, ), name="data_pad")
    k = tvm.reduce_axis((0, 3), name='kh')
    conv_padded = tvm.compute((16, ),
                              lambda oh: tvm.sum(data_pad[oh + k] * kernel[k],
                                                 axis=[k]),
                              name="conv_padded")
    s = tvm.create_schedule(conv_padded.op)
    s[data_pad].compute_inline()
    print(tvm.lower(s, [data, kernel], simple_mode=True))