Relay op strategy

haichen · December 28, 2019, 1:07am

Problem

The ultimate goal of this RFC is to enable TVM to generate multiple kernels for operators with symbolic shapes. Currently, ops like conv2d and dense don’t support codegen for symbolic shapes. But it’s hard for a single kernel to achieve good performance universally under different shapes. Therefore, we need a mechanism in TVM to codegen and dispatch an op to different kernels based on runtime shapes.

Second, the current design of TOPI and autoTVM is overcomplicated which requires much effort to figure out and add new TOPI and autoTVM templates. The figure below depicts a high-level diagram of how a relay op is lowered to the actual implementation. The complicated dispatch logic has two major drawbacks: (a) it’s hard to find out which implementation is actually used for a target; (b) the test cases may not cover all the implementation in the TOPI.

Third, template keys in autoTVM are used to represent different algorithms for op implementation, e.g., direct vs winograd for conv2d. But it has been abused to allow autoTVM to identify more templates. For example, “int8” is used as a template key, which seems inappropriate IMO.

Proposal

We propose to add FTVMStrategy function attr to Relay op that defines the kernel dispatch rules. Correspondingly we remove the dispatch logic from TOPI and remove the template key from autoTVM template to simplify the design. As a result, TOPI only contains a set of plain functions to define op compute and schedule for different targets, and some of which can be autoTVM templates. For instance, we can define compute/schedule function and autoTVM templates as follows:

def schedule_softmax(outs): ...

@autotvm.register_topi_compute("conv2d_nchw.x86") # autoTVM task name
def conv2d_nchw(cfg, data, kernel, strides, padding, dilation, layout, out_dtype): ...

@autotvm.register_topi_schedule("conv2d_nchw.x86") # autoTVM task name
def schedule_conv2d_nchw(cfg, outs): ...

Further, we don’t need to manually define autoTVM tasks for every template. What we need to do is to register compute and schedule template with the same unique autoTVM task name, then it can automatically match them together.

Now let’s look at the definition of FTVMStrategy. We define FTVMStrategy as a generic function, so that we can define different strategies for each target.

// Function signature:
// OpStrategy(const Attrs& attrs,
//            const Array<Tensor>& inputs,
//            const Type& out_type,
//            const Target& target)
using FTVMStrategy = GenericFunc;

The data structure is defined as follows. We realize that in many cases, compute and schedule are highly coupled (e.g., conv2d in x86). Therefore, we use OpImplement that includes a pair of compute and schedule function to represent an op kernel implementation. Under each specialized condition (e.g., batch < 16), we can add multiple implementations with different priorities. Finally, an OpStrategy contains a set of specialization.

class OpImplementNode : public relay::ExprNode {
  FTVMCompute fcompute;
  FTVMSchedule fschedule;
  Integer plevel;
};

class OpSpecialization : public relay::ExprNode {
  Array<OpImplement> implements;
  // could be undefined, indicating no condition required
  SpecializedCondition condition;
};

class OpStrategyNode : public relay::ExprNode {
  Array<OpSpecialization> specializations;
};

class OpStrategy : public relay::Expr {
  void AddImplement(FTVMCompute fcompute, 
                    FTVMSchedule fschedule,
                    int plevel);
};

The following figure shows the new codegen diagram. When compile engine encounters an op with concrete shapes, it collects all valid implementations, queries the autoTVM log if possible, and adopts the implementation with best performance or highest priority. If TVM compiles op with symbolic shape, it will generate multiple kernels, each corresponding to the implementation with highest priority under each specialized condition, and a dispatch kernel that invokes the corresponding kernel based on runtime shape. In the future, we should also enable auto-tuning for strategy and allow it to load config from file.

relay-strategy

At last, the following code snippet shows two examples of how to define strategy. For simple ops like softmax and concatenate, which share the same compute across targets, we only need to register a schedule function for each target. Otherwise, the dense strategy shows a general case where we can define a more complicated strategy that uses different computes and schedules.

@schedule_softmax.register(["cuda", "gpu"])
def schedule_softmax(_, outs):
    return topi.cuda.schedule_softmax(outs)

@dense_strategy.register("cpu")
def dense_strategy(attrs, inputs, out_type, target):
    strategy = _op.OpStrategy()
    m, k = inputs[0].shape
    strategy.add_implement(wrap_compute_dense(topi.x86.dense_nopack),
                           wrap_topi_schedule(topi.x86.schedule_dense_nopack),
                           10)
    if "cblas" in target.libs:
        strategy.add_implement(wrap_compute_dense(topi.x86.dense_cblas),
                               wrap_topi_schedule(topi.x86.schedule_dense_cblas),
                               5)
    with SpecializedCondition(k > 16):
        strategy.add_implement(wrap_compute_dense(topi.x86.dense_pack),
                               wrap_topi_schedule(topi.x86.schedule_dense_pack))
    return strategy

There are a few caveats for this change:

NNVM compiler can no longer use TOPI since we only register the OpStrategy to Relay ops. It should be okay since NNVM compiler is already deprecated.
Existing autoTVM logs cannot be used after this change as the workload name will be different.

I have a WIP implementation at https://github.com/icemelon9/tvm/tree/op-dispatch.

Since this is a significant change, let’s use this thread to discuss the design for some time.

@tqchen @kevinthesun @yzhliu @jroesch @comaniac @zhiics @ziheng @merrymercy @junrushao @masahi @eqy @ajtulloch

junrushao · December 28, 2019, 4:07pm

Thank you for bringing up such a nice proposal!

At first glance, I think I have a few very naive questions:

In your proposal, GenericFunc is the signature of FTVMStrategy. Does this GenericFunc refer to the GenericFunc in tvm/targets?
We saw two layers of abstraction: FTVMStrategy and OpStrategy. Could you elaborate their relationship? Does it mean that FTVMStrategy maps DeviceType (e.g. “cpu”, “gpu”) to OpStrategy?
I didn’t see too many details about SpecializedCondition. Is this a TVM expression?

Thanks!

comaniac · December 29, 2019, 7:28pm

I like this proposal a lot. Here are few points/questions.

Could we unified the strategy decorator for strategy definition? In the current example simple ops have to use schedule_XX.register([target list]) while others use XX_strategy.register([target list]). It looks a bit confusing. Maybe we can only provide the later one and let the simple case have only one strategy for simplification (e.g., softmax_strategy.register(["cuda", "gpu"]))?
The AutoTVM task definition is also a bit confusing to me. For example, assuming we have a conv2d compute function working for both CPU and GPU but different schedule functions. In this case how should we register AutoTVM tasks? The following code snippet illustrates this case, although this double decorators will not work as expected:


@autotvm.register_topi_compute("conv2d_nchw.cuda")
@autotvm.register_topi_compute("conv2d_nchw.x86")
def conv2d_nchw(...):

@autotvm.register_topi_schedule("conv2d_nchw.cuda")
def schedule_conv2d_nchw_cuda(...):

@autotvm.register_topi_schedule("conv2d_nchw.x86")
def schedule_conv2d_nchw_x86(...):

On the other hand, intuitively every registered OpImplement should be an AutoTVM task, so I am imagining we could register AutoTVM tasks when adding implement to strategy. Something like

strategy.add_implement(wrap_compute_dense(topi.x86.dense_nopack),
                       wrap_topi_schedule(topi.x86.schedule_dense_nopack),
                       10,
                       autotvm_task_name="dense_x86")

where autotvm_task_name indicates an AutoTVM task. Since the same compute function can appear multiple times in different implement pair, this approach seems more flexible. Please help clarify if I misunderstand the AutoTVM task registration mechanism.

jonso · December 29, 2019, 7:26pm

I think this is a great proposal! I agree with @comaniac in that we should have a unified way to define a strategy. I think it is important for code cleanliness and ease of development.

Can you please explain the changes to AutoTVM to allow it to tune over multiple implements per condition? And how that data will be saved in the AutoTVM logs? I personally think that this is an exciting change, as it can allow us to try multiple types of scheduling, rather than hard-coding and tuning over a few parameters - I am just curious what it will end up looking like.

haichen · December 30, 2019, 12:31am

Yes, you’re right.
FTVMStrategy is a function which takes arguments of op attrs, input tenors, output type, and target, and returns an OpStrategy. Since FTVMStrategy is a generic function, it can be overwritten for different device type.
SpecializedCondition is a new node I added. You can find its definition at https://github.com/icemelon9/tvm/blob/op-dispatch/include/tvm/schedule.h#L747-L780. It contains conditions in conjunctive normal form.

tqchen · December 30, 2019, 4:53pm

Thanks for the proposal, I like the overall organization that:

It favors logical dispatch over backend
It collects the valid strategies as a search space, rather than bake in heuristics too early

Here are some minor questions:

I noticed data structures are relay::Expr, what is the implication of that(e.g. are we treating them as special relay functions?)?
How can we mix and match generic implementations, e.g. a common template that can be shared across backend(mostly in terms of code organization)

kevinthesun · December 30, 2019, 6:36pm

Thank you for this proposal! I have a question about OpSpecialization. What does ‘priority’ mean in this context? How do we select if multiple implements are provided for a specialization?

haichen · December 30, 2019, 9:13pm

@comaniac

schedule_XX and XX_strategy are both generic function defined in op/strategy/generic.py. Then we can re-define them for different targets. To resolve the confusion, I therefore use two different APIs to register the schedule and strategy to relay op, shown below.
```
reg.register_schedule("nn.log_softmax", strategy.schedule_softmax)
reg.register_strategy("nn.dense", strategy.dense_strategy)
```
It’s a bit difficult to unify both of them since they are both generic functions.

We can potentially only use XX_strategy. But to define simple strategy, we need to write more code to so. For example,
```
@softmax_strategy.register(["cuda", "gpu"])
def softmax_strategy(attrs, inputs, out_type, target):
    strategy = _op.OpStrategy()
    compute = _op.get("softmax").get_attr("FTVMCompute")
    strategy.add_implement(compute, wrap_topi_schedule(topi.cuda.schedule_softmax))
    return strategy
```
Yes, you’re right. We cannot use double decorators. In that case, we can then define a conv2d_nchw in topi generic, and then call generic function in both x86 and cuda with the autotvm decorator. But in most cases, the autotvm compute template are different across targets.

Futher, register_topi_compute and register_topi_schedule not only registers them to a autotvm task, also wraps the compute and schedule function such that the first argument becomes autotvm config. So we cannot simply only register compute and schedule function in the add_implement.

haichen · December 30, 2019, 9:24pm

AutoTVM log format remains the same as previous. When we extract autotvm tasks from a relay program, the compile engine will invoke all implements registered in the OpStrategy. Therefore, AutoTVM can tune over multiple implements. When compiling the program, compile engine will also query the autoTVM logs for each implement and pick the one with lowest latency. Code can be found here.

For example, previously when cblas library is added in the target, we must use extern op with cblas library in the TOPI compute. Now we can register cblas extern op as an autotvm template, which allow autoTVM to log its latency, and add it as an implement in the strategy. During the compilation, we can then determine whether cblas or TVM implementation is faster from autotvm log.

haichen · December 30, 2019, 9:27pm

I just want to let the data structures inherit from node system. Probably I should just use tvm::Node then?
I have two common templates for broadcast/injective and reduction ops. And when registering strategy these categories, we will just use register_strategy_injective and register_strategy_reduction. Is this what you suggest?

comaniac · December 30, 2019, 9:55pm

Thanks for the helpful clarification!

For 1, I agree with you that unifying these two cases is not trivial, although I personally prefer the single XX_strategy solution as it looks more straightforward even with mode code. I am open to either way.

For 2, I agree that different schedule functions share the same compute function would rarely happen, so your clarification makes sense.

A miner informatino for people who are interested in the wrapper functions: Since the function name wrap_compute_dense seems not that straighforward to me, I traced the code and put a short explanation here: This wrapper function unpacks the info from a generic signature (attrs, inputs, out_type) and maps them to the compute function signature (cfg, data, weight, bias=None, out_dtype=None, for example).

tqchen · December 31, 2019, 3:02am

If we don’t have special consideration, perhaps it is worth to make it an object first. As long as we can mix and match implementation strategies I think we are good

junrushao · December 31, 2019, 8:57am

I agree with @tqchen. To make things easier, if there is no special consideration, we could inherit from Object and ObjectRef.

junrushao · December 31, 2019, 4:42pm

On a second read, I felt increasingly like this proposal - it addresses a lot of the long-standing fragmentation issues inside the tvm projects due to legacy.

My next concern is extensibility. This is understandable that SpecializedCondition is currently a CNF expressible in TVM IR, but is it possible to allow other forms of conditions? Like complicated hand-written ones in C++.

I think once those concerns are properly settled, we are ready to add more new operators to Relay and TOPI.

FrozenGene · January 2, 2020, 9:08am

Thanks for bringing in this proposal, I like this proposal overall, especially SpecializedCondition. Because in our schedule for arm cpu, we have tensorize / no tensorize according to whether H W could mod 8 or not. If we have this, we could implement it using SpecializedCondition elegantly!

However, I care performance impact a lot. I still don’t fully understand your answer for tunable priority. For example, we have winograd algorithm of autotvm, we could add direct and winograd auto tvm task in the whole tasks queue and let auto tvm decide what is better, but seems we can not complete it now, because we will set one priority for each op implement. If not correct, please correct me.

comaniac · January 2, 2020, 5:45pm

My understanding is that AutoTVM does not related to the strategy and condition, because we have to register AutoTVM tasks in TOPI separately. When collecting tasks, AutoTVM should collect and tune all registered tasks, including direct and winograd in your example (please correct me if I misunderstood). After the tuning, your log will have the records for both tasks.

When applying the history best with the tuning log, the original proposal mentions

When compile engine encounters an op with concrete shapes, it collects all valid implementations, queries the autoTVM log if possible, and adopts the implementation with best performance or highest priority

Accordingly, I think the priority works only when there’s no other judgements (i,e., no tuning logs) for different implements. If you have tuning records for multiple implements, the compile engine will pick up the best one that fits the condition.

haichen · January 2, 2020, 6:25pm

@FrozenGene @comaniac is correct about the priority. Priority is only used when there is no tuning logs or compiling for symbolic shape where there are multiple implements registered.

In your example of direct and winograd for conv2d, we can register direct with no SpecializedCondition and winograd with the SpecializedCondition of kernel size <= 5. So when compiling a conv2d with 3x3 kernel, compile engine will find both implements available for this op and query the AutoTVM log to determine which one is better.

@kevinthesun Hope this answer also help you understand priority.

haichen · January 2, 2020, 6:32pm

The reason I use CNF to express SpecializedCondition is because it’ll be easier to find out the bound of shapes. I think it can also serve as a hint to enable autoTVM to tune the symbolic shape, which we can probably replace the shape var by the lower bound.

Do you suggest we should define the SpecializedConditon using PackedFunc which we have more flexibility to write the conditions?

haichen · January 29, 2020, 8:29pm

The PR can be reviewed now.

wda · June 24, 2020, 5:27am

Now，I woulde like to use te.extern to add an extern op level library ，like cuDNN. For complex ops, like conv2d or dense , we can add new dispatch rules in op strategy,like @softmax_strategy.register([“cuda”, “gpu”]), and then redefine the compute with te.extern. But for some injective or broadcast ops, TVM define the same compute function across all targets. As follows,

RELAY_REGISTER_OP("nn.relu")

    .describe(R"code(Returns the relu input array, computed element-wise.

.. math::

   max(x, 0)

)code" TVM_ADD_FILELINE)

    .set_num_inputs(1)

    .add_argument("data", "Tensor", "The input tensor.")

    .set_support_level(1)

    .add_type_rel("Identity", IdentityRel)

    .set_attr<FInferCorrectLayout>("FInferCorrectLayout", ElemwiseArbitraryLayout)

    .set_attr<FTVMCompute>("FTVMCompute", [](const Attrs& attrs, const Array<te::Tensor>& inputs,

                                             const Type& out_type) {

      return Array<te::Tensor>{topi::relu(inputs[0], 0.0f)};

    });

Now, It seems that dispatch logic has been removed from TOPI. So how can I add new compute for relu? @haichen