Loop partitioning and Tensorization - Work on different IR levels

Designs like VTA, TPU and many mobile NN accelerators present hardware designs that lack control flow support like handling if conditions. Therefore, it becomes responsibility of the compiler to solve for these ‘if’ conditions at compile time and produce sizable chunks that the hardware can work on. These chunks can then be mapped to hardware ISA using TVM’s tensorize feature.


A simple motivation example is

Inputs variable - A[60], B[60]
Outputs variable - C[60]
Computation - A[i] + B[i]

realize compute([0, 60]) {
  produce compute {
    for (i, 0, 60) {
      compute(i) =(A(i) + B(i))
    }
  }
}

Suppose, the hardware can only work on 16 elements at a time. We can use split feature and get the following AST

// attr [compute(compute, 0x1873900)] realize_scope = ""
realize compute([0, 60]) {
  produce compute {
    for (i.outer, 0, 4) {
      for (i.inner, 0, 16) {
        if (likely(((i.inner + (i.outer*16)) < 60))) {
          if (likely(((i.inner + (i.outer*16)) < (60 - 0)))) {
            compute((i.inner + (i.outer*16))) =(A((i.inner + (i.outer*16))) + B((i.inner + (i.outer*16))))
          }
        }
      }
    }
  }
}

And, finally we need to perform loop partitioning solving these ‘likely if’ conditions at compile time. The final output looks like this

// attr [compute(compute, 0x1873900)] realize_scope = ""
realize compute([0, 60]) {
  produce compute {
    for (i.outer, 0, 3) {
      for (i.inner, 0, 16) {
        compute(((i.outer*16) + i.inner)) =(A(((i.outer*16) + i.inner)) + B(((i.outer*16) + i.inner)))
      }
    }
    for (i.inner, 0, 12) {
      compute((48 + i.inner)) =(A((48 + i.inner)) + B((48 + i.inner)))
    }
  }
}

Current support - There is Loop partitioning pass that supports this kind of partitioning to some extent. And we have tensorization to perform mapping of a sub-compute loop nest to a HW instruction.

Problem - Here, we need to perform loop partitioning first and then perform Tensorization. However, this is a big problem as the IR has already been lowered to AST for loop partitioning. Whereas, tensorization works on much higher IR level - Schedule data structures.

@tqchen @ziheng @thierry Do you have any ideas to solve this issue? In my opinion, VTA might have already encountered this type of problem.

3 Likes

We should have another codegen pass after loop partition.
The way we solve this issue is,

  1. use a new tensorize api we name it eimt_insn to replace tensorzie in schedule, which just put a pragma on the head of loops.
    using tensorize is like,

     vec_intrin = cce.cce_intrin.intrin_factor()
     intrin_vlog_t = vec_intrin("elewise_single_intrin_cce")
     intrin_vmuls_t = vec_intrin("elewise_single_intrin_cce")
     intrin_vexp_t = vec_intrin("elewise_single_intrin_cce")
    
     s[vlog_tL].tensorize(s[vlog_tL].op.axis[1], intrin_vlog_t(((factor,)), "elewise_single_log", dtype, dtype, None))
     s[vmuls_tL].tensorize(s[vmuls_tL].op.axis[1], intrin_vmuls_t(((factor,)), "elewise_single_vs_mul", dtype, dtype, [power_num]))
     s[vexp_tL].tensorize(s[vexp_tL].op.axis[1], intrin_vexp_t(((factor,)), "elewise_single_exp", dtype, dtype, None))
    

use emit is like,

    s[vlog_tL].emit_insn(s[vlog_tL].op.axis[1], "vlog")
    s[vmuls_tL].emit_insn(s[vmuls_tL].op.axis[1], "vmuls")
    s[vexp_tL].emit_insn(s[vexp_tL].op.axis[1], "vexp")
  1. peel the loops into body and tail, and loops in both of them have the pragma.
  2. emit insn pass will call callback you’ve defined for the pragma, this pass will give the callback loop IR, in which you will have all the information of tensorization.
  3. in predefined callbacks, we check and transform the input ir to the instructions we want.

the emit insn pass is like,

class EmitInsns : public IRMutator {
 public:
  Stmt Mutate_(const AttrStmt* op, const Stmt& s) final {
    if(attr::IsPragmaKey(op->attr_key)
       && op->attr_key == "pragma_emit_insn"){
      std::string str = op->value.as<StringImm>()->value;
      const runtime::PackedFunc* f = runtime::Registry::Get("tvm.intrin.cce."+str);
       // if pattern exists, call f to emit insn
      if (f != nullptr) {
        Stmt r = (*f)(op->body);
        CHECK(r.defined()) << "intrinsic rule must always return valid Expr";
        if (!r.same_as(s)) { return this->Mutate(r); }
      } else {
        LOG(FATAL) << "No such intrinsic rule: " << "tvm.intrin.cce." + str;
      }
    }
    return IRMutator::Mutate_(op, s);
  }
};

and the predefined hooks is like,

@tvm.register_func("tvm.intrin.cce.exp")
def elewise_unary_exp(op):
    """
        :param op: the stmt of for with exp
        :return: the intric stmt what we want
    """
    return vec_single_elewise(op, "vexp")

The next step would be we can eliminate emit_insn call in schedule for most of operators like + - * / or calls like exp, log, since for pattern simple like these we can recognize them in pass.

@xqdan So in your emit_insn approach, you don’t match (loop-partitioned) Stmt with pre-defined tensorization compute node, instead you construct tensorization given the Stmt, right?

yes. that’s true. you don’t need to define compute again.

Thanks @xqdan , just curious, the reason you didn’t do sort of “auto-tensorization” on Stmt level (basically means matching two Stmt and replace one with another), is because this is too complicated? or more worse, even not doable?

Glad to see the interesting discussion here. This is somehow related to our previous discussions on this issue. In particular, there are a few things we can do

    1. run a tensorize + loop partition to split into scalar + tensor code
    1. pad the input so that the compute matches tensor layout perfectly
    1. do a high level transformation to enable that

We do have auto-tensorization code now, it’s doable, but for some complicated ops like unsorted_segment_max, we do need tensor op, since it’s difficult to parse the information in low level IR, tensor op is more straightforward.

BTW, auto-tensorization here means eliminating emit_insn call in schedule for most of operators like + - * / or calls like exp, log

if “auto-tensorization” here means idea proposed by @derisavi-huawei,https://github.com/dmlc/tvm/issues/695, IMO, we need to make sure tensorize can work for all ops first, and then we can make it automatic. However, for some complicated ops, tensorize has matching issue and it’s hard to have a complete fix for this, see https://github.com/dmlc/tvm/pull/1053

But @derisavi-huawei 's proposal is working on Schedule/Stage level, while after loop-partitioning, it becomes lower-level AST IR - as @janimesh described.

Correct me if I didn’t understand correctly.

Yes, this is another reason we come up with low level tensorize idea to solve the issue @janimesh mentioned.
@yzhliu

This makes sense. Thanks!

there is a pass named partitionLoops in Halide, why TVM don’t use this pass but write a new loop partition for TVM?

I am sorry to revive this topic, but:
I also have a problem like the original from @janimesh and am not really understanding the solution posted by @xqdan.

Conceptually my problem is the same as OP
I have a copy operation (so basically to get from DRAM data into an accelerator’s SRAM)

taskDataIn = tvm.placeholder((in_b,in_c,in_h,in_w), name='taskDataInput')
bufDataFM = tvm.compute(taskDataIn.shape, lambda *i:taskDataIn(*i), name="featureMapBuf")

Now, because of my tiling required and my decision of where to compute bufDataFM, the generated code has an if(likely(....)) statement.

produce featureMapBuf {
      for (i2, 0, 62) {
        for (i3, 0, 34) {
          if (likely(((i3.outer*28) < (66 - i3)))) {
            featureMapBuf[((i2*34) + i3)] = taskDataInput[(((i3.outer*28) + (i2*66)) + i3)]
          }
        }
      }
}

Trying to tensorize it using something like this

def intrin_data(n,c,h,w):
    srcData = tvm.placeholder((n,c,h,w),name='srcData')
    destData = tvm.compute(srcData.shape, lambda n_,c_,h_,w_: srcData[n_][c_][h_][w_] , name='destData')

    srcBuf = tvm.decl_buffer(srcData.shape, name='bufSrc', offset_factor=1)
    destBuf = tvm.decl_buffer(destData.shape, name= 'bufDest', offset_factor=1)

    def intrin_func(ins,outs):
        #code omitted
    with tvm.build_config(offset_factor=1):
        return tvm.decl_tensor_intrin(destData.op, intrin_func, binds={srcData: srcBuf, destData: destBuf} )

Throws the error:
Tensorize failed, split condition likely(((i3 + (i3.outer*28)) < 66)) relies on var defined inside tensorize scope

Which is the kind of error OP would have gotten.
So according to OP something like this would be possible, if LoopPartition would come before tensorization, which is not the case.

The workaround proposed was to:

So basically

  1. Create a pragma where you would have called tensorize (i.e. at the right level of the corresponding operator)
  2. Add an IR Pass right before code generation, which detects these pragmas and peels the loops and each loop has again the pragma
  3. (I dont realy understand this point)
  4. “Do” (quotes because we made it by construction already) the pattern matching and collapse the loop structure into the call

My questions are:

  • Are there any IR routines to support loop peeling in the TVM IR pass catalogue?
  • Could someone explain point 3. in some more detail?
  • Is the above workaround equivalent/comparable to what the VTA does?
    On that last question what I mean is the following:
    While building the schedules for the conv2d in the VTA, they annotate the DMA copies with a corresponding env.dma_copy pragma. In the IR pass list, they inject a couple of custom passes. One of which is the inject_dma_intrin pass which transforms code blocks likes this
produce data_buf {
          // attr [iter_var(i0, )] pragma_dma_copy = 1
  for (i2, 0, 9) {
    for (i3, 0, 16) {
      for (i5, 0, 16) {
        data_buf[((((i2*16) + i3)*16) + i5)] = tvm_if_then_else((((((1 - i2) <= (i2.outer*7)) && ((i2.outer*7) < (15 - i2))) && (1 <= i3)) && (i3 < 15)), data[((((((((i2.outer + (ic.outer*2))*7) + i2)*14) + i3)*16) + i5) + -240)], (int8)0)
        data_buf[(((((i2*16) + i3)*16) + i5) + 2304)] = tvm_if_then_else((((((1 - i2) <= (i2.outer*7)) && ((i2.outer*7) < (15 - i2))) && (1 <= i3)) && (i3 < 15)), data[((((((((i2.outer + (ic.outer*2))*7) + i2)*14) + i3)*16) + i5) + -240)], (int8)0)
      }
    }
  }
}

into this

produce data_buf {
    VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), data, ((((i2.outer + (ic.outer*2))*7) - min((i2.outer*7), 1))*14), 14, ((min((i2.outer*7), 1) - max(((i2.outer*7) + -6), 0)) + 8), 14, 1, (1 - min((i2.outer*7), 1)), 1, max(((i2.outer*7) + -6), 0), 144, 2)
}

Obviously in this VTA example, they dont do loop peeling because the VTALoadBuffer2D is not a “real” VTA instruction.

@aca88
I met the problem too. And unfortunately, inject_dma_intrin also fails to convert code into accelerator instruction if the code contains if (likely(...)) statement. In my case, the if (likely(...)) statement was introduced by padding.

@yzhliu should be able to answer this.

  1. emit insn pass will call callback you’ve defined for the pragma …

In tvm/build_module.py you call stmt = ir_pass.EmitInsns(stmt), when it executes Stmt r = (*f)(op->body); f is the registered function elewise_unary_exp. so that in elewise_unary_exp you can play with the statement directly - it is an AST and you can traverse the if-else conditions.

btw ir_pass.LoopPartition allows you to eliminate if-conditions in the loop. if this is what you were looking for.

Thanks for sharing your experience. To be honest, I find it not working in your case kind of odd.
Looking at the type of code TVM generates for the DMA copy, in the VTA example I posted, there is an tvm_if_then_else statement which is actually dealing with padding. My experience working with the out-of-the-box Resnet example for VTA never had a problem dealing with the tvm_if_then_else.
Also looking at the VTA Hardware Guide

“In addition, the load module can insert 2D padding on the fly, which is useful when blocking 2D convolution.”

It could be helpful if you post more information from your specific example.

@yzhliu Thanks for explaining step 3 in more detail. I understand at a high level what you mean, but I would like to ask 2 more questions.

As you said

But looking at the code from xqdan, the if (!r.same_as(s)) { return this->Mutate(r); } still boggles me.

  1. If r is the new sub-AST of the elewise_unary_exp (which I guess would only be a function call) how can it evaluate to being same_as the original AST? (or is same_as only checking class type and not equivalence of ASTs?)
  1. Would you mind explaining a little of how this IR pass works and what are its limitations?
    (example: I changed in tvm/build_module.py the configuration to have "partition_const_loop": True and I am setting tvm.lower(..., simple_mode=False) when I lower the functions. This does not generate partitioned loops in my example.)
    EDIT: Apparently changing the config at the line of code I linked was not the right way. I used a with tvm.build_config(...) before my tvm.lower(...) and at least the loop partitioning seemed to work. I do still get if(true) stamements, but these get deleted by the ir_pass.Simplify(). I would still be interested in knowing generally what are the limitations though.

@aca88 A quick question about the same issue, do you mean after you set the loop partition with build_config, the tensorization works without throw previous error " split condition relies on var defined inside tensorize scope". I am asking because I tried add the loop partition, without tensorization, it lowered into partitioned loops, but with tensorization, it still throw the same error.

Please allow me to bring up the question again. Because I have exactly the same problem as @janimesh :Tensorize could work with loops with split condition.
After going through the discussion, I am a bit confused.
Is the low level tensorize idea posted by @xqdan a proposal ? or it is ready to use ?
If it is the latter case, I wonder if we could have a example of how to use it.
Let’s say a example in tensorize tutorial:

N, M, L = 1024, 512, 64
A = tvm.placeholder((N, L), name=‘A’)
B = tvm.placeholder((M, L), name=‘B’)
k = tvm.reduce_axis((0, L), name=‘k’)
C = tvm.compute((N, M), lambda i, j:tvm.sum(A[i, k] * B[j, k], axis=k), name=‘C’)
s = tvm.create_schedule(C.op)
factor = 10
x, y = C.op.axis
z, = C.op.reduce_axis
yo, yi = s[C].split(y, factor=factor)
s[C].reorder(x, yo, yi, z)
def intrin_gemv(m, l):
… omit
gemv = intrin_gemv(factor, L)
s[C].tensorize(yi, gemv)
print(tvm.lower(s, [A, B, C]))

With this split factor, the generate loop will have tail condition.
And I still want to tensorize yi.

With the proposed approach, how to handle this case ?
To avoid “Tensorize failed, split condition relies on var defined inside tensorize scope”.

Please let me know if possible. @janimesh @xqdan @yzhliu

Thanks,
-Wenlei