Operators fusion about training


#1

Hi, everyone.
Suppose operator A and B can be fused into one operator in forward process of neural networks, such as CONV+BN, and pair of element-wise operations etc.

Now when considering backward process of network training, there is a possibe that dA and dB (dA and dB mean the gradients of A and B respectively.) cantnot do fusion operation anymore. The fact is that computation of dB usually relies on output of A, therefore, we cannot get the output of A when A and B are fused.

Is there any solution to operators fusion for training?
Any comment is welcome.


#2

The strategy depends on the compiler’s algorithm. As far as I know, TVM fuser should notice that A is used by both B and dB and decide not to fuse. One could imagine more sophisticated (or scheduler-driven) algorithm which duplicate A to A1 and A2 nodes, and then fuse A1 with B for inference and re-calculate A2 for backprop.


#3

TVM doesn’t support fusion during training, so you’ll have to set opt_level=-1. There is some opportunity for fusing the gradient (i.e. backward) computations, but that’s not yet implemented. Out of curiosity, how are you dealing with actually getting gradients? It’s still the case that many interesting operators (e.g., pooling, conv2d) don’t have gradients registered.


#4

How about auto-generating gradient for fused ops? Is that going to solve this problem?

Somebody is already working on auto-generating gradient ops from TVM compute definition (see here). Seems like a good use case to me.


#5

Yes, I ever consider the re-calculate strategy for backprop, but it is expensive because most of backward ops have connetions to their forward ops.


#6

TVM, and DL frameworks such as MxNet and TF, are focusing op fusion for inference process. If cosidering training process on mobile device, we should care computation cost of backward process, and training will benefit much from gradient computations fusion. I think it is a good opportunity to do this.

We compute the gradients of pooling and conv2d explicitly.


#7

Out of curiosity, how are you dealing with actually getting gradients?

Somebody is already working on auto-generating gradient ops from TVM compute definition

@masahi is correct, we work at computing gradients automatically. We expect it will have an interface to overwrite default gradients with hand-optimized versions. Right now, we do not plan to alter default fuser behavior and will likely see a sub-optimal performance. So we are worrying about it too.

May I suggest an idea to implement schedulers-like interface to fuser, so we can ask it to always fuse specific nodes? I assume that fuser could always fuse some operations by the cost of duplicating other operations. Please correct me if I am wrong.

Below is the illustration of the possible fuser interface

  x = tvm.placeholder((batch_size, img_h, img_w, img_c),name='x')
  y = tvm.placeholder((batch_size, num_classes),name='y')
  w1 = tvm.placeholder((3,3,img_c,f1_c),name='w1')
  b1 = tvm.placeholder((f1_c,), name='b1')
  c1 = topi.nn.conv2d(x, w1, 1, 0, layout='NHWC', out_dtype=tvm.float32)
  c2 = c1 + topi.broadcast_to(b1, (batch_size,1,1,f1_c))

  # ...  rest of the model and finally a loss ...

  l = - topi.sum(y * topi.nn.log_softmax(t)) / batch_size
  ones = topi.full_like(l, 1.0)
  params = [w1,b1,w2,b2,w3,b3,w4,b4]

  # Computing gradients with AD looks like this, see RFC #1996
  dl = list(tvm.ir_pass.JacobianRecursive(l, params, ones))

  s = tvm.create_schedule([p.op for p in [x,y,l] + params + dl])

  # Ask fuser to force-fuse nodes C1  (a convolution) and C2
  s[c1,c2].force_fuse()

#8

Operator fusion in NNVM and Relay happens before we call anything in topi. So by the time we get to invoke registered topi compute or schedule definitions, operator fusion has already been done. Not sure how your interface would be going to work.


#9

Auto-generating gradient for fused ops is more attractive, but it may depend, and not work for all fused ops, I think. Does Rely support the method now?


#10

No, Relay and Tensor level autodiff are orthogonal developments. Tensor level autodiff is being developed outside of the Relay team at UW. It looks very promising and I am looking forward to seeing its progress.