[RFC] Implement "add_to" semantic in TVM

Recently we implemented an infrastructure for implementing MXNet operators by TVM ( https://github.com/apache/incubator-mxnet/pull/15550 ) One missing feature is “add_to” semantic for backward operation, https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/op_attr_types.h#L57

We have two possible approaches,

  1. Bind output and input to the same buffer, e.g.,
    C = compute(A, B)
    we introduce another input placeholder C_temp
    C = C_temp + compute(A, B)
    and bind C and C_temp to the same buffer
    Pros: no need to change TVM
    Cons:
  • Hard to understand
  • question: Can it describe all cases?
  1. C = tvm.compute(…, mode=“add_to”)
    In codegen, we replace C = ... by C += ...

What do you think? @reminisce @junrushao @tqchen

While admittedly say add_to is used in frameworks of the previous generation, let’s discuss about this: Do we really need it? Relay’s gradient pass just produces gradients without mutation and it looks perfectly fine, so in which case we have to rely on mutation?

Another thing that we could think about is that: is approach 1 always correct? If so, let’s simply do approach 1 instead. (assume we have turned off the pointer alias __restrict__)

Do we need such kind of operator for optimizer, or we have better alternatives?

Yep lets find some alternatives :slight_smile:

I agree compile techniques can be used to optimize “add”. and for long term mxnet can adopt such optimization.

but let’s focus on how to support current use case. It totally makes sense that, because of the previous reason, we’d like to use option 1, while I’m wondering whether it has any problem such as what @junrushao mentioned.

If I understand correctly, add_to(a, b) increment a, with b. During this process, the value of a will be changed.
The second approach is a, imho, RED FLAG idea, that I dont think we should do.
If add_to is implemented as above, it will greatly complicate Operator Fusion, Gradient, Partial Evaluation, FoldScaleAxis… etc. Basically all existing pass will be in great danger.
The problem is mutation introduce two problem: in optimization, you dont know if the tensor had been changed somewhere else, and you dont know if changing the tensor will change the code in other places (because of alias).
In particular, see the section Invalidation and Aliasing in Automatic differentiation in PyTorch - they implement delicate datastructure (which is hard to do as source code transformation in relay, which does not pollute the runtime), and reject some case. Right now, we have no such problem because we dont have mutation for tensor.
As another example, the performance of ConstantFolding will tank, and basically be useless, because it is hard to know whether some constant is really constant - or they got changed. Thus it will just do nothing or risk being wrong. Same story go for Fusion.
Tensor should continue to has no mutation, and Approach 1 look way better.
If approach 1 cannot describe all cases, we can still wrap every tensor as a reference of tensor, and in every operator unwrap/wrap accordingly. This will tank the performance of most pass, and some will probably downright fail, but this is what will happend to every single program if we allow mutation in tensor.

I’m wondering whether it has any problem such as what @junrushao mentioned.

If there are unknown alias/unknown add_to in other places of the code, it cannot be model as option 1. Let’s hope it doesnt happend.

After some offline discussion with Yizhi, I do believe that add_to is somehow useful in deep learning workloads. In the same time, I feel bad if we directly introduce such source of mutation. We probably need to find a way out.

should we implement it in relay level?

directly introducing mutation doesn’t seem to be a perfect idea. we should think about it

Just so I get it right, this is what I understand the discussion to be here:

  1. MXNet has in-place mutation ops, like add_to(A, B) (i.e. A += B).
  2. Relay doesn’t allow in-place mutating ops (though I’d assume that backends can do that).
  3. Converting from MXNet to Relay is then difficult, as it requires converting mutating ops to a functional style. This is formally undecidable (right?), though probably tractable in practice.
  4. Conversion would be more straightforward if Relay did support mutating ops, but then that moves the complexity from the conversions into all optimizations and backends in Relay, which defeats part of the purpose of an IR like Relay.

If I understood that correctly (otherwise let me know!): This is a recurring issue in these kinds of systems. It’s the same thing for TF -> HLO, TF -> Relay and other cases. It’s a big problem in a lot of places.

Part of the story here is that deep learning models are typically (always?) natural to express without mutation, so the question is where the mutation in MXNet is coming from? Is it users insisting on inserting that into their models because they like to do that, or is MXNet introducing it unnecessarily or causing users to have to do that unnecessarily? If MXNet is introducing it unnecessarily, could it stop doing that?

(If mutation is unavoidable in the input, I’d also think that Relay as-is still shouldn’t have it, though maybe there could be infrastructure around/above Relay to aid conversions and runtime in dealing with this kind of thing.)

1 Like

Good discussion. I totally agree that this doesn’t make quite much sense when it comes to introducing such kind of mutation.

Part of the story here is that deep learning models are typically (always?) natural to express without mutation

In inference, yes. But in training, people are really fond of modifying the gradient vector inplace.

so the question is where the mutation in MXNet is coming from?

When doing backward computation, MXNet continuously add gradient back to the buffer. Relay’s AD doesn’t require this.

Is it users insisting on inserting that into their models because they like to do that

Hmm, somethings people really want to play with a gradient vector which is a contiguous buffer in memory.

If MXNet is introducing it unnecessarily, could it stop doing that?

I am not sure if it would hurt performance. If it doesn’t, I think it makes sense to stop doing the unnecessary thing.

I did a bit more digging and asking around over here.

PyTorch has an op add_(A, B) which is also in-place, so that’s very similar.

For TF, Jong Wu works on TF-> TVM. He was OK with repeating this quote here: “Currently we just support tf image classification models, we do not see this issue in classification models, and I don’t see it in object detection models (ssd, mask-rcnn, etc.) either. But we did see someone in tvm discuss forum reported unsupported ops like ‘Assign’ Implementing new operators for TensorFlow, usually people reported ‘Assign’ with training ops like RestoreV2, I am not sure if Assign will be removed after freezing model in TF”

Where I’ve seen it come up is in training, matching Junru Shao’s response above. If it just doesn’t happen much in inference, then that’s a good thing.

If the main place this comes up is in training is the addition inside the weight update rule (TF “optimizer”), applying the gradients to the weights, that Junru Shao pointed out above, then that seems like a very structured situation that should be fairly straightforward to pattern match away inside the converter?

(Though I worry whether this is just the starting step in an ever longer set of such patterns. The TF2XLA project had/has a lot of trouble converting TF to HLO, but I think graphdef is substantially harder to deal with than MXNet, so perhaps that won’t be the same thing there.)

2 Likes

Alex Wong and I looked at the PyTorch part. Seems like add_ is supported, but not quite clear how it works. As part of this, we found this other PyTorch file dealing with add_ in a different context, though see the comment there. Not clear if that code is in use.

In the first entry in this thread, it says as option 1 (if I’m understanding it correctly), to have Relay/TVM be able to see an op like A = Add(B, C), where this is the last use of e.g. B, and then decide to place B and A into the same memory buffer and have the implementation of Add be such that it still works when B and A point to the same memory. My question on that is whether this is a new feature that would have to be implemented or whether that is already something that TVM can and will do? (this is how it works in XLA, and it has been working fine there, so this is definitely a viable approach, though pushing this kind of optimization to its limits can get complicated in the face of nested loops, reuse of input and output parameter memory, other control flow etc.)

Alex Wong and I looked at the PyTorch part. Seems like add_ is supported, but not quite clear how it works. As part of this, we found this other PyTorch file dealing with add_ in a different context, though see the comment there. Not clear if that code is in use.

As its comments says, their pass is actually incorrect:

NOTE: this is NOT SAFE, since it assumes that the LHS is not aliased by another value. This is only to avoid breaking ONNX export; when alias analysis is done we can emit a warning if someone tries to export.

Pytorch has lots of problem introduced because they support add_ in Particular, their ad algorithm become way more complex because it has to deal with inplace mutation. Right now they only implemented a check that bail if inplace update is used (which discard the old value needed for ad.) It will also complicate other passes such as checkpointing. Pytorch is in an weird state where they sort of assume tensor is immutable in some places, and assume the otherwise other places. When the two assumption collide bad things happend. That is why I think the right way to do this is to use Reference.

1 Like

Approach 1 does not mean no changes to TVM at all. If we fuse a reduction to the addition, at least we have to add an option to turn off the initialize-reduction-buffer-to-zero procedure.