Problem with trying to tensorize ROI Pooling


#1

Hi, everyone.

I’m now trying to tensorize the ROI Pooling operator, using an intrin which merges vectors to find the element-wise maximum. But tensorize fails at InferTensorizeRegion with error:

  File "/home/jian/environments/tvm-official-env/tvm/src/op/tensorize.cc", line 105
TVMError: Check failed: r.defined(): cannot deduce region of tensorized scope for input Tensor(shape=[1, 32, 18, 256], op.name=cdata)

I thought the error comes from the nested Load (Call node) in ROI Pooling compute body, and EvalSet can’t evaluate the range of Call node (BTW, it can’t evaluate Cast node too). I was wondering how can I solve this problem.

Details

The intrin declaration looks like this:

def vector_merge_max(dtype):
    in1 = tvm.placeholder((1, 32), dtype, 'in')
    k = tvm.reduce_axis((0, 1), 'k')
    out = tvm.compute((32,), lamdba i: tvm.max(in1[k, i], axis=k), 'out')
    in_buf, out_buf = ......  # omitted
    def lower_func(ins, outs):
        ......  # omitted
    return tvm.decl_tensor_intrin(out.op, lower_func, name='vector_merge_max', binds={in1, in_buf, out:out_buf})

The intrin itself works fine, using it to tensorize ordinary max pooling successfully generates correct code.

Then I used the ROI Pooling compute in tpoi/vision/rcnn/roi_pool.py, with small modification (the data layout changed to NHWC, and uses tvm.max reducer instead) . After schedule, the lowered IR looks like:

produce compute {
  for (i, 0, 100) {
    for (ph, 0, 4) {
      for (pw, 0, 4) {
        for (c.outer.init, 0, 16) {
          for (c.inner.init, 0, 32) {
            roi_pool[((((((((i*4) + ph)*4) + pw)*16) + c.outer.init)*32) + c.inner.init)] = -65504.000000h
          }
        }
        for (rh, 0, (min(max((int32(ceil((((float32((ph + 1))*float32(max(((int32(round(rois[((i*5) + 4)])) - int32(round(rois[((i*5) + 2)]))) + 1), 1)))*0.250000f) + -0.000010f))) + int32(round(rois[((i*5) + 2)]))), 0), 32) - min(max((int32(floor((((float32(ph)*float32(max(((int32(round(rois[((i*5) + 4)])) - int32(round(rois[((i*5) + 2)]))) + 1), 1)))*0.250000f) + 0.000010f))) + int32(round(rois[((i*5) + 2)]))), 0), 32))) {
          for (rw.outer, 0, (min(max((int32(ceil((((float32((pw + 1))*float32(max(((int32(round(rois[((i*5) + 3)])) - int32(round(rois[((i*5) + 1)]))) + 1), 1)))*0.250000f) + -0.000010f))) + int32(round(rois[((i*5) + 1)]))), 0), 18) - min(max((int32(floor((((float32(pw)*float32(max(((int32(round(rois[((i*5) + 3)])) - int32(round(rois[((i*5) + 1)]))) + 1), 1)))*0.250000f) + 0.000010f))) + int32(round(rois[((i*5) + 1)]))), 0), 18))) {
            for (c.outer, 0, 16) {
              // for(rw.inner, 0, 1): this is a dummy reduce axis, and tensorize will be done on this axis.
              for (c.inner, 0, 32) {
                roi_pool[((((((((i*4) + ph)*4) + pw)*16) + c.outer)*32) + c.inner)] = max(roi_pool[((((((((i*4) + ph)*4) + pw)*16) + c.outer)*32) + c.inner)], data[((((((min(max((int32(floor((((float32(pw)*float32(max(((int32(round(rois[((i*5) + 3)])) - int32(round(rois[((i*5) + 1)]))) + 1), 1)))*0.250000f) + 0.000010f))) + int32(round(rois[((i*5) + 1)]))), 0), 18) + (((min(max((int32(floor((((float32(ph)*float32(max(((int32(round(rois[((i*5) + 4)])) - int32(round(rois[((i*5) + 2)]))) + 1), 1)))*0.250000f) + 0.000010f))) + int32(round(rois[((i*5) + 2)]))), 0), 32) + (int32(rois[(i*5)])*32)) + rh)*18)) + rw.outer)*16) + c.outer)*32) + c.inner)])
              }
            }
          }
        }
      }
    }
  }
}

As we can see, the body contains nested Load, that’s why EvalSet fails. But the inner Loads, such as rois[((i*5) + 3)], don’t use variables in tensorized scope (namely, rw.inner and c.inner), so their domain ranges are actually single_point in the tensorized scope.

my ideas

  1. modify EvalSet and makes it able to handle loads (Call node) which have single_point ranges. I don’t know if this is reasonable.
  2. when InferTensorizeRegion, detect those invariant nested loads in tensorized scope, and replace them with new variables whose domain ranges are single_points. Those variables can be assigned outside the tensorized scope.

Thank you for any idea!