Tensorizing Pool+Act+conv2d


Hello all,

After reading about TVM and VTA for quite some while, I decided to do my first real try at using it.
I seem to have overestimated my knowledge and seem to be stuck in a couple of things.
So I would appreciate the help :slight_smile:

Background Info

I am trying to do something similar to the VTA examples. So assume that we have an accelerator for CNNs (fyi: the accelerator doesnt exist as anything else as a concept). Like in the VTA example (and most of the TVM literature) we want to fuse the Conv2ds and the activations (which are up to now only ReLus). One thing we want to do differently is to fuse popular pooling patterns (right now only 2x2s ‘max’).

The TVM Schedule

Now, I know TVM’s operator fusion will not do this fusion and therefore I have written per hand one example of a TVM task in which conv2d, relu and pooling are fused in the patter that we expect.
I found it very easy to just compute_at() at the right levels to match our expected program flow

from __future__ import absolute_import, print_function
import os
import tvm
import topi
#Description of Layer parameters
in_channel = 16
in_height = 8
in_width = 8
out_channel = 32
kernel_h = 3
kernel_w = 3
pad_h = 1
pad_w =1
stride_h = 1
stride_w = 1
pool_w = 2
pool_type = 'max'
ofm_h_shape = (in_height + 2 * pad_h - kernel_h) // stride_h + 1
ofm_w_shape = (in_width + 2 * pad_w - kernel_w) // stride_w + 1
#Placeholders for IFM and Weights
ifm_t = tvm.placeholder((batch_size,in_channel,in_height,in_width), name='ifm_t')
kernel_t = tvm.placeholder((out_channel,kernel_h,kernel_w), name='kernel_t')
#Conv2d Op
dy = tvm.reduce_axis((0,kernel_h), name='dy')
dx = tvm.reduce_axis((0,kernel_w), name='dx')
ic = tvm.reduce_axis((0,in_channel), name='ic')
ofm_t = tvm.compute((batch_size,out_channel,ofm_h_shape,ofm_w_shape),
                    lambda b,co, h, w: tvm.sum(
                    ifm_t[b,ic, h*stride_h+dy, w*stride_w+dx] *
                    kernel_t[ic, dy, dx],
                    axis=[ic, dy, dx]),
#Relu Op
act_t = topi.nn.relu(ofm_t)
#Pooling Op
pool_t = topi.nn.pool(act_t,[pool_h,pool_w],stride=[pool_h,pool_w], padding=[0,0,0,0], pool_type=pool_type )
#Scheduling Rule
sch = tvm.create_schedule(pool_t.op) 
act_t_axis = sch[act_t].op.axis 
sch[ofm_t].compute_at(sch[act_t], act_t_axis[-1])
sch[act_t].compute_at(sch[pool_t],pool_t_axis[-1]  )
print(tvm.lower(sch,[ifm_t, kernel_t, pool_t],simple_mode=True))

The Output

// attr [tensor] storage_scope = "global"
allocate tensor[float32 * 1 * 32 * 4 * 4]
// attr [ofm_t] storage_scope = "global"
allocate ofm_t[float32 * 1 * 1 * 1 * 1]
produce tensor {
  for (ax1, 0, 32) {/*Output Channel dimension*/
    for (ax2, 0, 4) {/*Output Height AFTER pooling*/
      for (ax3, 0, 4) {/*Output Width AFTER pooling*/
        produce compute {
          for (i2, 0, 2) {/*Intermediate Height, which should be equal to pooling kernel*/
          /*Question 5: BEGIN of tensorize() */
            for (i3, 0, 2) {/*Intermediate Width, which should be equal to pooling kernel*/
              produce ofm_t {
                ofm_t[0] = 0.000000f
                for (ic, 0, 16) {/*Input Channel*/
                 /*Question 4: BEGIN of tensorize()*/
                  for (dy, 0, 3) {/*Conv2D's Kernel Height*/
                    for (dx, 0, 3) {/*Conv2D's Kernel Width*/
                      ofm_t[0] = (ofm_t[0] + (ifm_t[((((((((ax2*8) + ax3) + (i2*4))*2) + i3) + (ic*64)) + (dy*8)) + dx)]*kernel_t[((((ic*3) + dy)*3) + dx)])) /*Convolution*/
               /*Question 4: END of tensorize()*/
              compute[(((((((ax1*4) + ax2)*8) + ax3) + (i2*4))*2) + i3)] = max(ofm_t[0], 0.000000f) /*Relu*/
        tensor[((((ax1*4) + ax2)*4) + ax3)] = -340282346638528859811704183484516925440.000000f /*Initialization for max pool*/
        for (rv, 0, 2) {/*Pooling Height Kernel*/
          for (rv, 0, 2) {/*Pooling Width Kernel*/
            tensor[((((ax1*4) + ax2)*4) + ax3)] = max(tensor[((((ax1*4) + ax2)*4) + ax3)], compute[(((((((ax1*4) + ax2)*8) + ax3) + (rv*4))*2) + rv)])/*Max Pool*/
      /*Question 5: END of tensorize() */

The Questions

  1. How can we control the name of the tensors produced by calling topi.nn operators?
    I was somewhat confused by the automatic names some of the operators are generating.
    In the snippet compute[] = //the output of Act and I would like it to be called act_t[] and tensor[]= // the output of Pooling and I would like it to be called pool_t[]

  2. Why is the array compute[] being ignored in the top part of the pseudo-code where all other arrays are being allocated?
    I kind of understand that ifm_t and kernel_tare just placeholders and don’t create allocations (is this correct?) but compute[]should be a tensor and should require previous allocation.

  3. Also, I think compute[]should be at most of size (1,1,2,2) I think this isn’t the case here… am I wrong?

(For the following questions I did some first trials and failed at tensorizing, but since it’s my first time I just want to know if its possible and get some hints instead of getting the end solution)

  1. If I wanted to tensorize everything between /*Question 4: BEGIN and the /*Question 4: END,
    would I have to create an operation which slices ifm_t such that it has a size of (1,1,3,3) and also the kernel_tto have size of (1,3,3) or is this not necessary?
    Also notice that the initialization ofm_t[0] = 0.000000f is outside of the “tensorization region”. How to handle this? (i.e. I dont want it inside the tensor intrinsic)

  2. Is it possible to tensorize the code between /*Question 5: BEGIN and the /*Question 5: END ? I ask because most of the tensorize examples I remember seeing only have one computing rule, which is no longer the case now.

  3. Also what are known limitations of tensorize?
    More specifically, are there examples of when tensorize cannot be used but a direct manipulation of the AST is necessary? (I guess tensorize is already an AST manipulation, but what I mean is the developer needing to manually design the AST manipulation)