I have updated the numbers. From just looking at the speedup numbers, the situation might look mixed.
But, I also think we should see this in a different manner. In both earlier PRs (one for CPU by jonso@, other for GPU by masahi@), we were looking at a very bad performance to start with, and then your PRs led to significant improvements.
Now, some of that performance improvement is gone with the new PR, but it still seems to be much better than the previous original situation (before your PRs). So, are these new perf numbers fast enough for e2e networks?
At the same time, new PR leads to significant improvement for other shapes and axis that were not considered in those earlier PRs (for example, VGG SSD softmax shape).
So, I would suggest to get this PR in. And we should think how can we get the performance up by looking at fusion. For example, if the reduce axis is -1, followed by an injective op, can we fuse it in Relay? These optimizations will be more generic.