Softmax is really slow

Hi there,

My network’s inference speed compiled by TVM with cuda is much slower than MXNet counterpart. (~120ms v.s. ~20ms)

I use nvprof to profile the result, the final softmax layer takes too long (~100ms). I think its the bottleneck.

The softmax layer’s input is a tensor which size is 15x336x448 (CxHxW) along C axis.

The following is the snippet of my nvprof result:

2.04882s 2.2981ms (256 1 1) (512 1 1) 32 0B 0B - - - - GeForce GTX 106 1 7 fuse_resize_kernel0 [3570]
2.05112s 3.9094ms (8 168 1) (8 2 5) 127 3.9688KB 0B - - - - GeForce GTX 106 1 7 fuse_conv2d_kernel0 [3573]
2.05503s 19.074ms (1 1 1) (1 1 1) 31 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel0 [3576]
2.07410s 59.040ms (1 1 1) (64 1 1) 18 256B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel1 [3578]
2.13314s 21.612ms (1 1 1) (64 1 1) 40 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel2 [3580]
2.15476s 2.9749ms - - - - - 8.6133MB 2.8274GB/s Device Pageable GeForce GTX 106 1 7 [CUDA memcpy DtoH]

Can anyone instruct how to speed up the softmax layer?

Yes I also saw this before. I think TVM’s cuda softmax is implemented assuming 2D input after dense layer. It shouldn’t be hard to generalize it for spatial input.

Hi @masahi,

Do you mean I only need to change the schedule of current softmax into spatial one for acceleration?

I’m not familiar with gpu programming. Can I just split more blocks and threads across input’s width and height to achieve better speed?

yes, only the cuda softmax schedule needs to be changed.

But you should think if you really need softmax. If you don’t need probability-like output, you can just remove it.

After some digging, I found out current softmax implementation is doing multi-thread only along reduction axis.

I change the multi-thread direction to along feature map’s width and height direction, now the speed meet my requirement. (~120ms -> ~20 ms)

Thanks for the help.

1 Like

nice, I’ve also sent a PR to fix this issue. It should be merged soon.