How to schedule fused ops?


#1

For example, I have a conv followed by a relu, and they will be fused in graph, so I can only run to the conv’s schedule func, and in the schedule func I can found the conv op by name. But what if I want to do schedule to relu? How do I know it’s not other ops followed? Did I have to copy the relu schedule here?


#2

You can still use op.name to find relu https://github.com/dmlc/tvm/blob/6a4d71ff40915611bd42b62994992b879e6be610/topi/python/topi/cuda/conv2d.py#L151
outs[0].op is the last op after fusion


#3

Thank you for reply!
I know the method, however, does this mean I need to write all the schedule branches for the possibly fused ops? And I need to copy their schedule here. Is there other solution?


#4

What do you mean by scheduling relu? Relu is elemwise op, which will be fused to the loop body of conv2d.


#5

In fact, I need to do tensorize in my schedule, so the intrin func will be different for relu, prelu and so on. And relu is a simple example, what if a complex op which need complex schedule was fused?


#6

We currently inline all fused elemwise ops. If you need to tensorize schedule of elemwise ops, I’m afraid that you have to handle each of them manually (perform computation of elemwise ops in your intrin func)


#7

Got it, thank you!
May I know why TVM just fuse compute and not fuse schedule? If there is a complex op fused, does this means its schedule have to be abandoned and we can just rewrite new schedule(or traverse_inline) there.


#8

I think it is difficult to define the expected result of fusing schedules. Take conv2d on CUDA for example, after scheduling conv2d, you can fusing any elemwise ops to it because conv2d has a stage of copying from registers to global memory, where elemwise computation can be performed. However, we can’t directly schedule conv2d+relu unless you have tensor intrincs backed by external implementation