[Tensor Expression] add hardware specific schedule

Currently, lots of AI chip companies define large op, such as conv, and build deep learning algorithms with them. Do you guys think it is better to add schedules to decide if large operators should be separated into loops? This can avoid the complication of tensorization and also narrow the scheduling
space by manually disabling to separate some of the large operators. TVM may also need to add schedules to help in separating large conv into smaller ones by tiling according to local memory restriction.

1 Like

Can you elaborate a bit more using examples?

Yeah. As I know, many ai chip companies do it like this way. Some companies write code to detect TVM’s for loop and combine it back to Convolution, finally they call its AI intrinsic. Very intesting way.

In Shanghai, there is more than 1000 ai chip companies. The competion is so brutal. Use large operator like Conv2D make sense. Hahahaha…:rofl:

1 Like

@tqchen, here is a simple example I have. I am from one of the largest IT company in China and found that almost every large AI chip companies provide conv insts in someway. I guess it is necessary to at least give users an opportunity to choose not expand some of the operators into loops.

E.g.
conv with image size (2048x2048), kernel size (3x3), padding(1)
= padding; 4 conv with image size (1026x1026), kernel size (3x3)
= padding; load1; conv1; store1; load2; conv2; store2;load3; conv3; store3;load4; conv4; store4;
= padding; load1; load2; conv1; load3; conv2; store1; load4; conv3; sotre2; conv4; store3; sotre4

For an asic with conv insts, I think this is a simple and good enough schedule.
(sometimes, padding may implement within load insts to avoid useless load/store)

I see, you can override the declaration of conv2d into extern and then directly specify how to generate the intrinsic in the low level IR. I do recommend to break down things as much as possible so we can take benefit from the automatic code generation and more flexibility

I agree with @tqchen . Generally speaking, The fact that TVM breaks conv2d into small operations seems to be a pain for ASICs at the first glance, but on the other hand, it provides opportunities for compiler to analyze loops, figure out dependencies (think about polyhedral for example) and generate better “tensorized” instructions. Just that such post-analyze & tensorize approach is fairly weak in TVM at the moment.

@tqchen, we also considered extern operations, but the power of autotvm cannot be applied then. Finally, we feel we cannot benefit from TVM anymore. I suspect that sometimes the autotvm can find better solution then user defined library when there are some sequential conv operators that can be combined (and pipelined) together. In the other hand, if autotvm can get the same performance as library, this can save lots of human resource.

@yzhliu, if we clearly define the memory attribute of the intrinsics, I think it will not be a problem for dependency analysis and software pipeline. I agree that this will require some work in the framework.

Hi there,

I was lurking and I think that the question in itself is interesting and the discussion was too short (for me) and would love to further expand on your problems (in my post as questions 1, 2) and the overall opinions of @tqchen and @yzhliu to my comments (and specifically answers to questions 3, 4).

So some AI accelerator chips have a native conv2d instruction. Your problem is that the TVM operators (and here I really mean TVM not NNVM/Relay) are too fine-grained.

  1. Or put in another way, the accelerator’s instruction (for conv) has the same level of abstraction of the NNVM/Relay node definition. Does this summarize your problem?

I will refer to the above as Sched1

In the case of Sched1, you have a larger problem size as workload (2048x2048) than the largest possible workload the accelerator can handle ( I am assuming 1026x1026). This actually is a contradiction to “the level of abstraction is the same as NNVM/Relay nodes” since they (NNVM/Relay Ops) are not constrained!

So the question now becomes: How does YOUR accelerator manage a larger problem size of a native instruction?
Now, again you say “Do you guys think it is better to add schedules to decide if large operators should be separated into loops?”
In this case I think yes, because the “vanilla” 6 dimensional loop of a conv2d (WITH (2048x2048) problem size) needs to be broken into (in this case) 8 loops BUT the iteration space needs to be the same in both cases meaning that those extra 2 loops are derived from (in this case) 2 of the original 6 dimensions (namely H and W).

  1. Is your difficulty basically that “for coarse grain instructions with some constraint (here INPUT memory size), it is obvious that given a larger workload the output needs to be a blocked version of the original workload” and that you would expect TVM to recognize this automatically?
    • My problem with this expectation is as follows:
      • remember I said that the problem went from a 6 dimensional loop to an 8 dimensional loop were 2 of the original dimension were split.
      • without any other knowledge
        • there is no way to select WHICH of the 6 original dimensions are to be split (and also which factor). The only reason I knew its H and W was because you gave the solution to the problem!
        • there is no way to define the ORDER BETWEEN the 2 new loops (I mean because it’s blocking these 2 loops should be the outermost but their ordering relative to one another is based on HW information). In your description this is for example not specified (I dont know which subblock is conv2 w.r.t. conv1: is it the block left/diagonal/down/right to it?)
      • so people wanting to target accelerators without (and probably with?) LLVM backends have to (at least?) create the scheduling rules which fit to the accelerator, which is nothing else than transforming a compute rule into a specific schedule for this compute rule
        • A very obvious way to see this is by examining your definition of the blocked variant. Your blocking variant does not reuse the overlap regions between blocks (i.e. thats why 1026!=(2048/2)). IF the HW could reuse the overlap region (at least in one dimension), then you could expect that load2 might load an input image (HxW) of size (1026x1024)! (assuming load2 is the block right to load1)
        • In other words even for the “obvious blocking” the fine details are HW specific!

Then

I will refer to this as Sched2

In the case of Sched2, you have done a small optimization w.r.t. Sched1.
Again, this optimization is not a generally valid one. It is a HW specific one.
I think even this would be possible but only if the scheduling rule is designed (again by the people wanting to target TVM for a new backend) in that way.
A general solution will require you to have an explicit SW pipeline way of dispatching the instructions. In other words, you will have some scheduling rule which outputs something like:

Prologue: load1; load2; conv1;
Body: 
    for (i=2, i<4,i++)
        load(i+1); conv(i); store(i-1);
    end
Epilogue: conv4; store3; sotre4;

//this is only one way... there is actually one with smaller prologue and epilogue

But again this is a HW specific thing and requires HW info implicitly while developing the scheduling rule.

  • In your case specific, you dont have a SuperScalar-like HW you have a VLIW-like HW and therefore you require the program stream (output of the compiler) to have these optimizations explicitly done
  1. Are SW pipelines possible with TVM scheduling primitives? (including a combination of them)

Now the only thing I haven’t addressed is

I am by no means an expert on AutoTVM so I will respond based on my understanding.
At least for the way of defining Sched1, I can see AutoTVM being usable here.
If you define the splitting factors of the vanilla (6 loops) as “knobs” which are tunable and then tensorize the 6 innermost loops (out of the 8!) with your HW intrinsic, then AutoTVM should be able to pick those splitting factors which are best for a specific workload on your accelerator.

Obviously the best blocking factors will be those which:

  • Utilize the highest parallelization of the HW (minimize computation runtime), while
  • Creating the least amount of data movement cycles (minimize data shuffling time), while
  • Probably some other constraint (like generating the least amount of overhead control code)

The more deterministic each of these things are, the lesser your need for AutoTVM to randomly search the space.
AND
The more deterministic your system is, the less you need AutoTVM at all.

  1. If it is possible to generate a SW pipeline structure, how could the “find best SW pipeline” problem be posed as an AutoTVM knob?

Thanks and I hope the big wall of text was at least interesting (and hopefully accurate)

1 Like

@aca88, thanks for your detailed analysis.

@yzhliu, after carefully read @aca88’s comment, I realized that even though tensorization is hard in general for arbitrary schedules, commonly used hardware acceleration tricks only a few ones that can be tensorized easily. I think I should not be entangled with the difficulty of tensorization too much.

@aca88, I think one way to do this is to add a hardware specific pass to reuse the loaded data after tvm give a schedule.

I agree that tensorization should be the general direction on how should we handle ASICs, as we did for VTA, and we should improve its features along the way.

The data reuse can automatically happen, as in the case of VTA, by properly schedule a larger load, and a few tensor instructions. See more at https://docs.tvm.ai/vta/tutorials/index.html

1 Like

@tqchen, choose a larger tile according to the size of local memory can reuse data automatically. But, I guess @aca88 means that some ASIC can reuse data between two continuous tiles. Does TVM or VTA support reusing data between continuous tiles somehow?

This is exactly what I meant.
AFAIK VTA does not do this optimization.
I am also unsure if this is a TVM or an accelerator thing.
If I remember correctly from the Halide thesis, Halide works from the outputs towards the inputs to infer which part of the input tensor is necessary to compute all scheduled outputs. So in conv2d, based on some output tensor computation you will get the somewhat larger input tensor size as “must have”. Obviously in tiling, this will lead to overlaps being necessary. So I think TVM will work in a similar way and therefore be unaware of such an optimization (and might even fail to compile? @tqchen).

In the VTA example, there is an added advantage of having the VTA runtime. I usually interpret this as the last chance to do some optimizations which are foreign to TVM.
So maybe this could be the right target for such an optimization. Because it will take as input the TVM code (which requires reloading overlaps) and transform this into VTA real instructions (which could implement the smaller DMA transfer).
What is missing in the VTA HW is that the buffers should implemented as ring buffers to really profit from this optimization. Because the overlap region in load1’s is at the end of the memory allocation, while for conv2 it would be required to be at the beginning.

1 Like

I get what you mean. You want something like a sliding window based fetching. This is something that VTA does not yet take benefit of. However, as an open source accelerator and compiler stack the benefit is we can introduce new primitives and this is something that we might be able to look into.

I also want to say that in most of the conv2d(3x3) workload, such kind of tile reuse could really be minimum, and negligible if you have big tile size, so they may not be necessary after all.

Sliding window based fetching will be helpful when kernel size is large (e.g. 7x7) and local buffer is small, and usually it only saves power but not the cycles since load/store is usually hided by computation. I think this should be done in a layer as machine related optimizations after the TVM flow.

@tqchen, do you think VTA can be more general to not as an example but as a machine IR for (almost) all ASIC backends?

1 Like

I am not sure if we could summarize a global machine IR, but we do intend to make VTA a class of accelerators so TVM stack can be used to target any accelerators people might built. @thierry might has more thoughts on this

ring buffer optimization is common and critical on dsp(or other VLIW-like Asic). Except tvm framwork can support this.