About the tensorization interface


#1

These days I am working on some tensorization stuff, and I found several things that makes the current tensorize interface not sufficient.

First, the tensorization declaration interface requires an TVM op. Originally I suppose it serves the purpose of software emulation (when the underlying hardware has no corresponding intrin support, we can use this Op to replace this code segment at least guarantee the correctness).

However, after using this interface, I realize the true purpose of this parameter is to indicate the shape of input/output data, and what we do in the Op actually does not matter. This is a little bit counter intuitive for developers I suppose. Can we just have an OpaqueOp, that only accepts input shapes and output shapes, and does nothing?

Second, another thing I notice is that tensorization is essentially a “primitive sugar” or “code transformation sugar” which offloads IRs under certain loop level. This interface is not aware of if this loop body is perfect tiled or not. Thus, this primitive cannot be applied when imperfect loop tiling.

I am curious if we can work around these two issues?


#2

what do you mean exaclty by “imperfect loop tiling”?

On the first issue, tensorization lets us essentially inline high-performance code that implements a matrix-matrix or matrix-vector multiplication inner-loop body. This is very useful when targeting special hardware intrinsics, like performing AVX512 based GEMV, or invoking an accelerator’s tensor core ISA, or performing neat tricks like bit-serial operations with vectorized popcount on ARM CPUs.


#3

what do you mean exaclty by “imperfect loop tiling”?

I guess it means the indivisible case


#4

That’s another problem, AVX512 are mostly 1-d instructions, so often it does not care about the shape. (I hope my assertion is correct).

The offloaded intrin still requires the a shape of small tensor, which makes the intrin defined ad-hoc. Sometimes, like doing NCHWxc, it is an across dimension Op. Sometimes, it is a simple 1-D operation. It is hard to find one piece to fit all once shape is introduced.


#5

Regarding the imperfect tiling : I think cause of that problem is tensorization is happening as part of ScheduleOps and before the LoopPartition pass.

There has been good discussion about this problem and solutions suggested were

  1. Auto-Tensorization 2) Having a separte pass that happens much later after the ScheduleOps and all the necessary IR trasnformations.

You can find the discussion here :