Softmax is really slow


#1

Hi there,

My network’s inference speed compiled by TVM with cuda is much slower than MXNet counterpart. (~120ms v.s. ~20ms)

I use nvprof to profile the result, the final softmax layer takes too long (~100ms). I think its the bottleneck.

The softmax layer’s input is a tensor which size is 15x336x448 (CxHxW) along C axis.

The following is the snippet of my nvprof result:

2.04882s 2.2981ms (256 1 1) (512 1 1) 32 0B 0B - - - - GeForce GTX 106 1 7 fuse_resize_kernel0 [3570]
2.05112s 3.9094ms (8 168 1) (8 2 5) 127 3.9688KB 0B - - - - GeForce GTX 106 1 7 fuse_conv2d_kernel0 [3573]
2.05503s 19.074ms (1 1 1) (1 1 1) 31 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel0 [3576]
2.07410s 59.040ms (1 1 1) (64 1 1) 18 256B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel1 [3578]
2.13314s 21.612ms (1 1 1) (64 1 1) 40 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel2 [3580]
2.15476s 2.9749ms - - - - - 8.6133MB 2.8274GB/s Device Pageable GeForce GTX 106 1 7 [CUDA memcpy DtoH]

Can anyone instruct how to speed up the softmax layer?


#2

Yes I also saw this before. I think TVM’s cuda softmax is implemented assuming 2D input after dense layer. It shouldn’t be hard to generalize it for spatial input.


#3

Hi @masahi,

Do you mean I only need to change the schedule of current softmax into spatial one for acceleration?

I’m not familiar with gpu programming. Can I just split more blocks and threads across input’s width and height to achieve better speed?


#4

yes, only the cuda softmax schedule needs to be changed.

But you should think if you really need softmax. If you don’t need probability-like output, you can just remove it.


#5

After some digging, I found out current softmax implementation is doing multi-thread only along reduction axis.

I change the multi-thread direction to along feature map’s width and height direction, now the speed meet my requirement. (~120ms -> ~20 ms)

Thanks for the help.


#6

nice, I’ve also sent a PR to fix this issue. It should be merged soon.