Out of curiosity, how are you dealing with actually getting gradients?
Somebody is already working on auto-generating gradient ops from TVM compute definition
@masahi is correct, we work at computing gradients automatically. We expect it will have an interface to overwrite default gradients with hand-optimized versions. Right now, we do not plan to alter default fuser behavior and will likely see a sub-optimal performance. So we are worrying about it too.
May I suggest an idea to implement schedulers-like interface to fuser, so we can ask it to always fuse specific nodes? I assume that fuser could always fuse some operations by the cost of duplicating other operations. Please correct me if I am wrong.
Below is the illustration of the possible fuser interface
x = tvm.placeholder((batch_size, img_h, img_w, img_c),name='x')
y = tvm.placeholder((batch_size, num_classes),name='y')
w1 = tvm.placeholder((3,3,img_c,f1_c),name='w1')
b1 = tvm.placeholder((f1_c,), name='b1')
c1 = topi.nn.conv2d(x, w1, 1, 0, layout='NHWC', out_dtype=tvm.float32)
c2 = c1 + topi.broadcast_to(b1, (batch_size,1,1,f1_c))
# ... rest of the model and finally a loss ...
l = - topi.sum(y * topi.nn.log_softmax(t)) / batch_size
ones = topi.full_like(l, 1.0)
params = [w1,b1,w2,b2,w3,b3,w4,b4]
# Computing gradients with AD looks like this, see RFC #1996
dl = list(tvm.ir_pass.JacobianRecursive(l, params, ones))
s = tvm.create_schedule([p.op for p in [x,y,l] + params + dl])
# Ask fuser to force-fuse nodes C1 (a convolution) and C2
s[c1,c2].force_fuse()