[RFC] Implement "add_to" semantic in TVM


#1

Recently we implemented an infrastructure for implementing MXNet operators by TVM ( https://github.com/apache/incubator-mxnet/pull/15550 ) One missing feature is “add_to” semantic for backward operation, https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/op_attr_types.h#L57

We have two possible approaches,

  1. Bind output and input to the same buffer, e.g.,
    C = compute(A, B)
    we introduce another input placeholder C_temp
    C = C_temp + compute(A, B)
    and bind C and C_temp to the same buffer
    Pros: no need to change TVM
    Cons:
  • Hard to understand
  • question: Can it describe all cases?
  1. C = tvm.compute(…, mode=“add_to”)
    In codegen, we replace C = ... by C += ...

What do you think? @reminisce @junrushao1994 @tqchen


#2

While admittedly say add_to is used in frameworks of the previous generation, let’s discuss about this: Do we really need it? Relay’s gradient pass just produces gradients without mutation and it looks perfectly fine, so in which case we have to rely on mutation?

Another thing that we could think about is that: is approach 1 always correct? If so, let’s simply do approach 1 instead. (assume we have turned off the pointer alias __restrict__)


#3

Do we need such kind of operator for optimizer, or we have better alternatives?


#4

Yep lets find some alternatives :slight_smile:


#5

I agree compile techniques can be used to optimize “add”. and for long term mxnet can adopt such optimization.

but let’s focus on how to support current use case. It totally makes sense that, because of the previous reason, we’d like to use option 1, while I’m wondering whether it has any problem such as what @junrushao1994 mentioned.


#6

If I understand correctly, add_to(a, b) increment a, with b. During this process, the value of a will be changed.
The second approach is a, imho, RED FLAG idea, that I dont think we should do.
If add_to is implemented as above, it will greatly complicate Operator Fusion, Gradient, Partial Evaluation, FoldScaleAxis… etc. Basically all existing pass will be in great danger.
The problem is mutation introduce two problem: in optimization, you dont know if the tensor had been changed somewhere else, and you dont know if changing the tensor will change the code in other places (because of alias).
In particular, see the section Invalidation and Aliasing in Automatic differentiation in PyTorch - they implement delicate datastructure (which is hard to do as source code transformation in relay, which does not pollute the runtime), and reject some case. Right now, we have no such problem because we dont have mutation for tensor.
As another example, the performance of ConstantFolding will tank, and basically be useless, because it is hard to know whether some constant is really constant - or they got changed. Thus it will just do nothing or risk being wrong. Same story go for Fusion.
Tensor should continue to has no mutation, and Approach 1 look way better.
If approach 1 cannot describe all cases, we can still wrap every tensor as a reference of tensor, and in every operator unwrap/wrap accordingly. This will tank the performance of most pass, and some will probably downright fail, but this is what will happend to every single program if we allow mutation in tensor.


#7

I’m wondering whether it has any problem such as what @junrushao1994 mentioned.

If there are unknown alias/unknown add_to in other places of the code, it cannot be model as option 1. Let’s hope it doesnt happend.