Automatic differentiation on the level of TVM IR


#1

Hello,

Currently in TVM/NNVM automatic differentiation is implemented only at the NNVM level and it requires manually defining gradients for every operation. If there was automatic differentiation on the TVM level then the gradients could be derived automatically from the FTVMCompute function.

Some competitors already have this feature, e.g. PlaidML, and recently Halide.

We have implemented a proof-of-concept automatic differentiation for TVM. It performs reverse mode differentiation of a Tensor with respect to some other given Tensor. Currently our implementation is very slow because it lacks certain important optimizations consisting in transforming the domain of iteration to avoid summing over lots of zeros. We are considering to implement these optimizations as a separate IR pass.

The proof-of-concept implementation is on GitHub. The main source file is here. Some usage examples can be found in this test.

So, what do you think?


#2

I’m interested. I’ve been also looking at recent development on Halide around autodiff system and I think it is very cool. If Halide can do it, so can TVM since both share the same IR.

As for TVM, they are developing a new IR for NNVM called Relay. Apparently a new autodiff system will be built on top of it (see this comment). I’m curious to know what are the pros and cons of implementing autodiff on HalideIR or Relay.


#3

The automatic differentiation of Relay is very similar to NNVM’s. It is defined over the Relay language and requires primitive operators (i.e those not defined in Relay and exposed from TOPI) to have gradients registered for them.

We could compose automatic differentiation over the TVM expression language with Relay’s AD. One extension we discussed is embedding TVM programs inside of Relay programs. We could imagine then having a unified AD framework which does AD over the embedded programs and then uses the gradients when performing AD over Relay programs to obtain a gradient over the entire function.

I think it would be very convenient if we can obtain a closure property over the TVM expression language. It would make writing new operators much more flexible across the stack. There are a couple people who have taken different stabs at this before, we should try to compare and contrast with existing approaches.


#4

Hi,
I like the idea of computing gradient directly from compute.

I had looked at the source code, and it look like that your code might provide quadratic blowup - suppose we want to calculate d(x * (x * (x * x)))/dx, it will expand into x * d(x * (x * x))/dx + dx/dx * (x * (x * x)), and left hand side will continue to expand.

I suggest implementing forward mode automatic differentiation for cases other then reduce, or use a wengert list and do pure reverse mode automatic differentiation.

For the efficiency concern, I had ran into the same problem when I was work on differentiating of operators for relay. Initially I want to reduce everything to compute and do them one and for all, but we give up and instead define gradient for each operators manually, as optimizations turn out to be hard - even if we can get the zeros away, I have no idea how to recover a good scheduling.

However, I do see great values of your works in rapid prototyping. I would love to follow up on the subsequent optimizations. I recommend reading Efficient Differentiable Programming in a Functional Array-Processing Language for more idea. (although they cannot compile to GPU from what I see)

EDIT: it seems like IRMutator share identical value so there are no quadratic blowup, can you guys write a test case of a giant x * x * x… to make sure?


#5

@MarisaKirisame Have you looked at the Halide autodiff paper? They have an automatic gpu scheduler for auto-generated backward ops.


#6

I just finished reading it. The number seems really good and I am impressed.


#7

Good discussions in here. At a high level, it is great to have tensor expression level AD. On the other hand, I think we will need the tensor level AD as well and we will primarily use tensor level AD while having expression level AD for certain custom ops.

The main reason is that there are high level information in the tensor graph level that is not necessarily easily exposed in the expression level. For example, we can use winograd to perform any of the conv2d computations, while the tensor expression may only corresponds to the direct algorithm.

Things might change as expression level AD get more mature, in the meanwhile, using this as complementary to tensor graph AD is a good level


#8

@MarisaKirisame This example indeed leads to quadratic blowup, thanks for pointing out. It definitely affects the build phase (compile time grows quadratically), but it may get optimized in some later passes (may be even in llvm), because the execution time seems to grow linearly. I think that currently tvm::ir::IRMutator doesn’t preserve implicit sharing of subexpressions in general case, and thus we observe quadratic behavior. I believe that making IRMutator and other passes respect implicit sharing should fix the problem. Another solution is to modify the automatic differentiation pass to make sharing explicit using let-expressions.