Store only rule for tensorize intrinsic?

@egy,

Now we have: Body, Zero, Update

  • Body (mandatory): does computation with initial init to zero of the accumulators.
  • Zero (can be None): only init to zero the accumulators, no computation.
  • Update (can be None): does computation without any init (only accumulate).

Cases:

  • In case Zero=None then a Body() followed Update() will be issued.
  • In case Update=None then only Body() is used everywhere.

See also: Update rule for tensorize

Question:

May I implement a separate Store() (as optional 4-th) in a PR (+ reflecting testcases) ?

Imagine a HW that would need a separate Store step as final nail-in (from hidden accumulators) to a final memory destination.

I think it would be useful for many HW, at this moment I need it for tensorization in MARLANN.

I am afraid I don’t understand what the Store() uses for?

Here is the brief workload for CUDA / Tensor Core GEMM or Conv2d:

  1. Load data from global memory to shared memory
  2. Load shared memory to local memory(register)
  3. Do computation and cache the result in register
  4. Write back from register to global memory (or shared)

In this case, we need to use a special instruction to do step 4 if we use Tensor Core. We can just tensorize the copy step to do it. (Please see the tutorial for the detail https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html)

I’m not sure the use case in MARLANN, and I can not imagine what the Store step looks like. It would be great if you can provide more information or an example. Thank you!

2 Likes

@Hzfengsy,

  • Attaching cache_write() to schedule does the trick.

Thank you for pointing me to the right place !

@Hzfengsy If we use tensorized store copy instructions in step 4, dose that means we can’t fuse gemm with other injective ops?

By the way, I don’t want to use shared memory as intermediate storage to store result, that will limit the shared memory amount that can be used (-> limit compute intensity).