Softmax - Sequence of Relay ops

@jonso @kevinthesun @yzhliu

Currently softmax has its own compute and schedule. I am wondering why not represent softmax as a sequence of Relay operators - exp, max, sum, divide. There are two key benefits

  • We can reuse all the work for reduce schedules across targets. This will clean up the schedules that we have written for softmax separately.
  • We are working on fast_exp. Representing softmax in this manner, we can directly apply fast_exp using FastMath pass (in development).

If we can get performance parity, then I’m all for it. However, the benefit of having it all be in one schedule is that the schedule can be specialized. For example, if the axis is -1, the CPU schedule will put all of the computations under the same loop and parallelize the outer dimensions. This provides a big performance improvement over doing each loop separately. The details are here.

Let’s do some experiments and see if the benefits of fast_exp will outweight the benefits of having the computations under the same loop.

I put up some numbers here - https://github.com/apache/incubator-tvm/pull/4878

Happy to run more experiments, if you think there is some special case.

@jonso @masahi

I have updated the numbers. From just looking at the speedup numbers, the situation might look mixed.

But, I also think we should see this in a different manner. In both earlier PRs (one for CPU by jonso@, other for GPU by masahi@), we were looking at a very bad performance to start with, and then your PRs led to significant improvements.

Now, some of that performance improvement is gone with the new PR, but it still seems to be much better than the previous original situation (before your PRs). So, are these new perf numbers fast enough for e2e networks?

At the same time, new PR leads to significant improvement for other shapes and axis that were not considered in those earlier PRs (for example, VGG SSD softmax shape).

So, I would suggest to get this PR in. And we should think how can we get the performance up by looking at fusion. For example, if the reduce axis is -1, followed by an injective op, can we fuse it in Relay? These optimizations will be more generic.

I see, my test case hits major slow down on cuda.

Nevertheless, I think sub millisecond is fast enough and softmax is never a bottleneck anyway, so until someone else notices and complains, I’m fine with moving forward with your PR :slight_smile:

I think it might not be a good idea to always simplify softmax. As in cuda, we can map softmax to cudnn implementation of softmax. But after you break down the softmax to sequence of ops, it’s no longer possible to do so. The optimal way to implement softmax is to fuse it into a single kernel, especially for cuda.

An alternative way to improve softmax performance is by re-using the reduction schedules in the softmax topi schedule instead of breaking down them to seq of ops.

Regarding fast_exp, we can probably add a approx softmax op or have an attribute in softmax to mark it using fast_exp.

These are all good points. @haichen, we could always “re-fuse” softmax using the new MergeComposite pass, and have that softmax call a cudnn or mkldnn external runtime. However, that’s a little more inelegant than what we have right now.

Thanks all for the comments. Let me give this some more thought.

How about relay provide a interface to define these ops, and relay express them with a op boundary. So we can do optimizations since these ops are white box for compiler, at the same time we are fine when we still want to lower them as single ops.

@tqchen