Hi there,
My network’s inference speed compiled by TVM with cuda is much slower than MXNet counterpart. (~120ms v.s. ~20ms)
I use nvprof to profile the result, the final softmax layer takes too long (~100ms). I think its the bottleneck.
The softmax layer’s input is a tensor which size is 15x336x448 (CxHxW) along C axis.
The following is the snippet of my nvprof result:
2.04882s 2.2981ms (256 1 1) (512 1 1) 32 0B 0B - - - - GeForce GTX 106 1 7 fuse_resize_kernel0 [3570]
2.05112s 3.9094ms (8 168 1) (8 2 5) 127 3.9688KB 0B - - - - GeForce GTX 106 1 7 fuse_conv2d_kernel0 [3573]
2.05503s 19.074ms (1 1 1) (1 1 1) 31 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel0 [3576]
2.07410s 59.040ms (1 1 1) (64 1 1) 18 256B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel1 [3578]
2.13314s 21.612ms (1 1 1) (64 1 1) 40 0B 0B - - - - GeForce GTX 106 1 7 fuse_softmax_kernel2 [3580]
2.15476s 2.9749ms - - - - - 8.6133MB 2.8274GB/s Device Pageable GeForce GTX 106 1 7 [CUDA memcpy DtoH]
Can anyone instruct how to speed up the softmax layer?