How to merge consecutive ops/schedules?

For example, if one was to create a conv2d op followed by a max pool op, what is the best way to optimize and combine the schedules for these two ops?

Optimizing each schedule is easy, as you can just call the relevant scheduler from topi. However, after each op is scheduled, what is the canonical way to merge these ops? I assume the process would be similar to however a graph is imported and scheduled, but I’m not sure where in the source code that process is happening.

In general, this process can be arbitrarily difficult as fused operators must be careful not to violate data dependencies between intermediate results. For well-defined cases, like fusing batchnorm and relu (elementwise operators) into convolution, this is done automatically at the graph level. The TVM IR provides primitives like compute_inline which will roll elementwise computation at the output into the current operator.

Fusing injective/elementwise ops makes sense to me. It seems my question wasn’t very clear, so I’ve attached some code below. I’m trying to schedule a conv and pool op consecutively. Atm, I have two separate schedules for each, how do I combine the two schedules into one so that I can pass it into tvm.lower?

    # conv
    output = topi.nn.conv2d(data, kernel, strides=stride, padding=padding, dilation=1, layout=layout)
    output = topi.add(output, bias)
    output = topi.nn.relu(output)
    s_c = topi.generic.nn.schedule_conv2d_nhwc(output)
    
    # pooling
    conv_out = tvm.placeholder(output.shape, dtype=dtype, name='conv_out')
    pool_out = topi.nn.pool(conv_out, (2,2), (1,1), (0,0,0,0), 'max')
    s_p = topi.generic.nn.schedule_pool(pool_out, layout='NHWC')

I’m currently prototyping for a new HW backend, so I want to be able to schedule and lower an arbitrary set of ops before moving on to running an entire graph. I’m operating under the assumption that importing a graph from ONNX/Mxnet/etc. follows a similar process to the one outlined here; (1) optimize and schedule individual ops like conv and pool + elementwise, (2) combine those optimized ops into a unified schedule for runtime.

This assumption requires that some method be used to traverse individually optimized ops/schedules to combine them into a final, runtime schedule. How would I go about manually creating that final schedule? or are these assumptions incorrect?

What is the difference between “an entire graph” vs. an “arbitrary set of ops?” A graph contain as few operators as you would like. Why is combining two schedules together a strict requirement for hardware prototyping?

Note that in general fusing operators with more complicated producer-consumer dependencies will come at the cost of increased synchronization and/or hardware support for synchronization.

Thanks for your quick responses, eqy. To clarify, when I say “running an entire graph”, I mean the full process of importing a trained model that may contain a variety of ops. This is as opposed to quickly instantiating those individual ops for testing purposes.

In essence, I’d like to be able to simulate parts of what would be present in an end-to-end model. For example, a conv+bias+relu then average pool. I have been successful at scheduling and optimizing these separately, but would like to know the canonical way in which these separate ops are combined into one schedule.

Combining two arbitrary schedules allows for more granular investigation of the interaction between operators. Whenever crashes, poor performance, and incorrect results arise, the fine control over testing different combinations of ops/schedules could allow for narrowing in on where/why these behaviors occur.

You can just write a small program in Relay and run the normal fuser provided you have platform specific schedules on hand. When you stick multiple compute stages together in TVM you will need to pick a master schedule to schedule the combined TVM level sequence of operators. We do this for VTA for example (VTA has its own schedules which allow for fusion, etc.).

Oh I see, so instead of using TOPI directly for ops and schedules, I would let Relay do the heavy lifting behind the scenes. Is the tutorial at this link using the normal fuser that you’re referring to?