Tvm + cudnn slower than mxnet + cudnn

chenjunweii · December 13, 2018, 4:13am

I just use api from_mxnet to get a nnvm.symbol and complie it with configuration

target = 'cuda -libs=cudnn'

and also

target = 'cuda -libs=cudnn, cublas'

but I got faster inference time when I use mxnet + cudnn

mxnet + cudnn = 0.03xxxx sec

tvm + cudnn = 0.06xxxx sec

I have tried using simple_bind and tensorrt_bind to inference, and both got 0.03xxx sec

any idea ?

reminisce · December 13, 2018, 5:37am

It would be interesting to see the network structure and input data shapes. To rule out the difference in execution engines, can you re-run mxnet test with environment variable MXNET_ENGINE_TYPE=NaiveEngine so that both TVM and MXNet use synchronized engines, and post the result here?

It is also strange that MXNet/TensorRT makes no difference than MXNet/CUDNN. What GPU were you using?

masahi · December 13, 2018, 5:44am

That is weird. Can you tell me what network you are running?

Also can you post output of nvprof on TVM and MXNet ?

chenjunweii · December 13, 2018, 1:29pm

@masahi @reminisce

I test the ssd + mobilenetv2 on Jetson TX2

I have tried MXNET_ENGINE_TYPE=NaiveEngine but it seems nothing different

NVProf Log

TVM

WARNING:autotvm:Cannot find config for target=cuda -libs=cudnn,cublas, workload=('conv2d', (1, 128, 1, 1, 'float32'), (16, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:243: 	CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		0) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD - time: 0.289728 ms, Memory: 262144
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		1) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM - time: 0.35104 ms, Memory: 0
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		2) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 0.36752 ms, Memory: 4608
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		3) CUDNN_CONVOLUTION_FWD_ALGO_GEMM - time: 0.413408 ms, Memory: 4608
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		4) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: 0.651232 ms, Memory: 626688
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		5) CUDNN_CONVOLUTION_FWD_ALGO_FFT - time: 2.20115 ms, Memory: 5087232
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		6) CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING - time: 6.98608 ms, Memory: 9564160
[12:52:56] /home/nvidia/src/tvm/src/contrib/cudnn/conv_forward.cc:246: 		7) CUDNN_CONVOLUTION_FWD_ALGO_DIRECT - time: -1 ms, Memory: 0
[*] Compile...
[*] Compile To Target cuda -libs=cudnn,cublas
[*] Model is Compiled
[*] Graph RunTime is Created
[*] Run 
Time Cost :  0.3401477336883545
=========================
Time Cost :  0.06860470771789551
=========================
Time Cost :  0.06574130058288574
=========================
Time Cost :  0.06613612174987793
=========================
Time Cost :  0.06572198867797852
=========================
Time Cost :  0.06570768356323242
=========================
Time Cost :  0.08966612815856934
=========================
Time Cost :  0.09905409812927246
=========================
Time Cost :  0.09668135643005371
=========================
Time Cost :  0.06572794914245605
=========================
[!] Evaluation Done
==6014== Profiling application: python script/evaluate_ssd_speed.py --network mobilenetv2 --size 300 --lib cudnn --board tx2 --device gpu
==6014== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   18.48%  1.05418s        21  50.199ms  4.3372ms  151.10ms  maxwell_cgemm_32x64_tn
                   16.38%  934.78ms        14  66.770ms  1.6934ms  188.96ms  maxwell_cgemm_64x64_tn
                    5.77%  329.25ms        12  27.437ms  737.60us  146.12ms  void fft2d_r2c_32x32<float, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    5.29%  301.67ms        44  6.8561ms  17.472us  41.612ms  void fft2d_r2c_16x16<float>(float2*, float const *, int, int, int, int, int, int, int, int)
                    5.06%  288.81ms         6  48.135ms  1.1947ms  112.19ms  void DSE::regular_fft_pad<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
                    4.79%  273.46ms        36  7.5960ms  31.520us  40.262ms  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *)
                    4.10%  233.80ms        30  7.7933ms  4.8298ms  46.946ms  fuse_conv2d_broadcast_add_relu_11_kernel0
                    3.34%  190.69ms        32  5.9590ms  135.84us  62.621ms  maxwell_gcgemm_32x32_nt
                    3.10%  176.70ms        10  17.670ms  1.6477ms  52.458ms  maxwell_gcgemm_64x64_nt
                    2.63%  149.97ms        12  12.497ms  71.040us  45.957ms  void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
                    2.51%  143.49ms        30  4.7829ms  2.4187ms  23.572ms  fuse_conv2d_broadcast_add_relu_9_kernel0
                    2.40%  136.76ms         6  22.793ms  476.07us  56.218ms  void DSE::regular_fft_pad<int=0, int=1, int=128, int=16, int=32, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
                    2.23%  127.42ms        31  4.1103ms  52.512us  15.936ms  void flip_filter<float, float>(float*, float const *, int, int, int, int)
                    2.11%  120.59ms        15  8.0395ms  62.880us  47.785ms  void fft2d_r2c_32x32<float, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    1.72%  98.337ms        20  4.9169ms  3.9676ms  6.9263ms  fuse_conv2d_broadcast_add_relu_14_kernel0
                    1.71%  97.527ms         6  16.255ms  306.95us  40.406ms  void DSE::vector_fft<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    1.70%  97.161ms        10  9.7161ms  5.5179ms  40.694ms  fuse_conv2d_broadcast_add_relu_7_kernel0
                    1.64%  93.852ms       288  325.87us  235.71us  856.64us  void fermiCgemm_v3_kernel<bool=1, bool=0, bool=0, bool=0, int=5, int=5, int=3, int=8, int=8>(int, int, int, float2 const *, int, float2 const *, int, float2*, int, int, int, float2 const *, float2 const *, float2, float2, int)
                    1.31%  74.715ms        74  1.0097ms  115.36us  7.2778ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    1.31%  74.582ms         6  12.430ms  239.20us  30.698ms  void DSE::vector_fft<int=0, int=1, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.95%  54.340ms        25  2.1736ms  49.920us  14.088ms  void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    0.87%  49.519ms       137  361.46us  108.00us  3.7813ms  maxwell_scudnn_128x64_relu_interior_nn
                    0.86%  49.270ms        10  4.9270ms  4.0799ms  6.6187ms  fuse_conv2d_broadcast_add_relu_12_kernel0
                    0.77%  43.894ms        91  482.35us  57.280us  2.9738ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.69%  39.188ms       129  303.78us  78.721us  3.9038ms  maxwell_scudnn_128x128_relu_interior_nn
                    0.63%  35.780ms        10  3.5780ms  2.2574ms  13.372ms  fuse_conv2d_broadcast_add_relu_4_kernel0
                    0.60%  34.040ms       102  333.72us  35.840us  1.4589ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
                    0.59%  33.637ms        79  425.78us  104.32us  3.5804ms  maxwell_scudnn_128x32_relu_interior_nn
                    0.52%  29.717ms        19  1.5640ms  224.80us  5.8789ms  maxwell_gcgemm_64x32_nt
                    0.49%  28.198ms        40  704.95us  63.520us  3.8014ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.48%  27.153ms         6  4.5254ms  1.3702ms  8.5315ms  maxwell_sgemm_128x128_nn
                    0.41%  23.215ms       102  227.60us  31.520us  1.4862ms  void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
                    0.35%  19.879ms        61  325.88us  40.160us  2.8043ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.32%  17.976ms        10  1.7976ms  1.0235ms  7.5548ms  fuse_conv2d_broadcast_add_relu_6_kernel0
                    0.29%  16.418ms        40  410.45us  196.67us  3.6380ms  fuse_conv2d_broadcast_add_relu_10_kernel0
                    0.28%  16.121ms       102  158.05us  14.400us  676.16us  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
                    0.28%  16.095ms        10  1.6095ms  941.96us  6.5471ms  fuse_conv2d_broadcast_add_relu_2_kernel0
                    0.27%  15.339ms        16  958.71us  288.16us  3.8527ms  maxwell_scudnn_128x32_relu_small_nn
                    0.24%  13.697ms       159  86.144us  1.3120us  2.1221ms  [CUDA memcpy HtoD]
                    0.23%  13.102ms        10  1.3102ms  748.16us  5.5063ms  fuse_conv2d_broadcast_add_relu_3_kernel0
                    0.19%  10.713ms        49  218.62us  42.688us  518.24us  void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    0.17%  9.9548ms        20  497.74us  281.60us  2.0933ms  fuse_conv2d_broadcast_add_relu_5_kernel0
                    0.16%  9.0157ms        10  901.57us  784.64us  1.3104ms  fuse_softmax_kernel1
                    0.16%  8.9904ms        12  749.20us  160.32us  1.8759ms  void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
                    0.14%  7.9311ms         3  2.6437ms  1.2671ms  4.1360ms  void DSE::regular_fft_clip<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
                    0.13%  7.5928ms        30  253.09us  102.08us  1.8130ms  fuse_conv2d_broadcast_add_relu_8_kernel0
                    0.09%  5.4197ms         6  903.29us  350.21us  2.2187ms  maxwell_scudnn_128x128_relu_small_nn
                    0.08%  4.7127ms        49  96.176us  20.640us  289.03us  void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int, int3, int3, int2, int, float, float, float*, float*)
                    0.08%  4.5299ms         6  754.99us  236.64us  1.7156ms  maxwell_sgemm_128x64_nn
                    0.08%  4.5257ms        35  129.31us  4.1600us  1.0426ms  compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
                    0.08%  4.2881ms        30  142.94us  112.64us  207.36us  fuse_conv2d_broadcast_add_relu_16_kernel0
                    0.07%  4.1960ms        10  419.60us  366.72us  608.64us  fuse_softmax_kernel2
                    0.07%  3.9386ms        10  393.86us  233.47us  1.6072ms  fuse_conv2d_broadcast_add_relu_kernel0
                    0.07%  3.9346ms        10  393.46us  232.32us  1.6067ms  fuse_conv2d_broadcast_add_relu_1_kernel0
                    0.06%  3.1846ms        12  265.38us  16.481us  1.1279ms  void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
                    0.05%  2.9727ms        30  99.091us  78.560us  142.08us  fuse_conv2d_broadcast_add_relu_13_kernel0
                    0.05%  2.8996ms         3  966.54us  273.22us  2.1454ms  void DSE::regular_fft_clip<int=1, int=2, int=128, int=16, int=32, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
                    0.05%  2.7446ms       367  7.4780us  2.5920us  84.288us  cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
                    0.05%  2.6829ms         4  670.72us  286.24us  1.0610ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.05%  2.6129ms        10  261.29us  213.12us  361.92us  fuse_conv2d_broadcast_add_relu_15_kernel0
                    0.04%  2.4041ms         3  801.38us  326.95us  1.4793ms  void DSE::vector_fft<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.04%  2.1482ms        10  214.82us  187.84us  311.04us  fuse_softmax_kernel0
                    0.04%  2.0812ms        30  69.374us  37.120us  778.79us  fuse_conv2d_broadcast_add_broadcast_add_2_kernel0
                    0.03%  1.9966ms        10  199.66us  115.36us  824.07us  fuse_conv2d_broadcast_add_kernel0
                    0.02%  1.2123ms        15  80.817us  26.080us  353.99us  void fft2d_c2r_32x32<float, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*)
                    0.02%  1.1846ms         6  197.43us  77.664us  438.47us  void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
                    0.02%  1.1516ms         1  1.1516ms  1.1516ms  1.1516ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.02%  1.0687ms        22  48.577us  13.600us  211.04us  void fft2d_c2r_16x16<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
                    0.02%  1.0616ms        20  53.081us  19.040us  349.44us  fuse_conv2d_broadcast_add_broadcast_add_1_kernel0
                    0.02%  1.0427ms        10  104.27us  59.520us  434.82us  fuse_conv2d_broadcast_add_broadcast_add_kernel0
                    0.02%  986.31us        10  98.630us  55.520us  416.16us  fuse_conv2d_broadcast_add_1_kernel0
                    0.02%  980.36us        20  49.017us  38.560us  70.880us  fuse_conv2d_broadcast_add_relu_17_kernel0
                    0.02%  949.15us        12  79.096us  23.520us  192.99us  void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradOutputParams<float, float>)
                    0.01%  843.97us         3  281.32us  103.90us  599.20us  void DSE::vector_fft<int=1, int=2, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.01%  840.26us        10  84.025us  33.440us  479.68us  fuse_conv2d_broadcast_add_3_kernel0
                    0.01%  839.59us         3  279.86us  109.44us  369.03us  void fft2d_r2c_32x32<float, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    0.01%  501.09us        10  50.108us  39.520us  72.033us  fuse_conv2d_broadcast_mul_broadcast_add_relu_kernel0
                    0.01%  497.89us        10  49.788us  41.440us  67.040us  fuse_conv2d_broadcast_add_relu_18_kernel0
                    0.01%  400.77us        20  20.038us  16.000us  28.480us  fuse_conv2d_broadcast_add_broadcast_add_3_kernel0
                    0.01%  299.39us        10  29.939us  17.088us  122.98us  fuse_conv2d_broadcast_add_2_kernel0
                    0.00%  212.48us        20  10.624us  8.5120us  15.968us  fuse_conv2d_broadcast_add_broadcast_add_4_kernel0
                    0.00%  203.97us        10  20.396us  18.080us  28.001us  fuse_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_concatenate_reshape_transpose_kernel0
                    0.00%  183.58us        10  18.358us  14.400us  26.080us  fuse_conv2d_broadcast_add_4_kernel0
                    0.00%  174.59us        10  17.459us  13.760us  25.281us  fuse_conv2d_broadcast_add_6_kernel0
                    0.00%  97.121us        10  9.7120us  7.7120us  13.280us  fuse_conv2d_broadcast_add_5_kernel0
                    0.00%  96.160us        10  9.6160us  8.4800us  12.960us  fuse_conv2d_relu_kernel0
                    0.00%  83.393us        10  8.3390us  7.0400us  10.880us  fuse_conv2d_1_kernel0
                    0.00%  73.024us        10  7.3020us  6.4320us  10.080us  fuse_multibox_prior_1_kernel0
                    0.00%  70.400us        10  7.0400us  6.2400us  9.0880us  fuse_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_transpose_flatten_concatenate_kernel0
                    0.00%  67.392us        10  6.7390us  6.0800us  8.8000us  fuse_conv2d_relu_1_kernel0
                    0.00%  66.976us        10  6.6970us  5.6320us  9.6000us  fuse_conv2d_kernel0
                    0.00%  52.192us        10  5.2190us  4.6400us  7.2000us  fuse_multibox_prior_kernel0
                    0.00%  51.360us        10  5.1360us  4.6400us  6.5600us  fuse_flatten_flatten_flatten_flatten_flatten_flatten_concatenate_reshape_kernel0
                    0.00%  37.664us        10  3.7660us  3.3600us  4.4480us  fuse_conv2d_2_kernel0
                    0.00%  37.184us        10  3.7180us  3.3600us  4.8000us  fuse_multibox_prior_2_kernel0
                    0.00%  34.048us        10  3.4040us  3.2000us  4.1280us  fuse_conv2d_3_kernel0
                    0.00%  32.961us        10  3.2960us  3.0400us  4.0000us  fuse_conv2d_relu_3_kernel0
                    0.00%  32.609us        10  3.2600us  3.0080us  4.0000us  fuse_conv2d_7_kernel0
                    0.00%  32.128us        10  3.2120us  2.8800us  3.8400us  fuse_multibox_prior_3_kernel0
                    0.00%  32.128us        10  3.2120us  2.9120us  4.0000us  fuse_conv2d_8_kernel0
                    0.00%  31.680us        10  3.1680us  2.8800us  4.0320us  fuse_conv2d_6_kernel0
                    0.00%  30.656us        10  3.0650us  2.7200us  3.8080us  fuse_conv2d_relu_2_kernel0
                    0.00%  30.400us        10  3.0400us  2.7200us  3.7120us  fuse_conv2d_9_kernel0
                    0.00%  28.992us        10  2.8990us  2.5920us  3.3600us  fuse_conv2d_relu_5_kernel0
                    0.00%  25.952us        10  2.5950us  2.2720us  3.2000us  fuse_conv2d_4_kernel0
                    0.00%  25.312us        10  2.5310us  2.2400us  3.3280us  fuse_conv2d_relu_4_kernel0
                    0.00%  22.720us        10  2.2720us  1.9520us  2.8800us  fuse_conv2d_10_kernel0
                    0.00%  22.208us        10  2.2200us  2.0800us  2.7200us  fuse_conv2d_relu_7_kernel0
                    0.00%  22.144us        10  2.2140us  1.9200us  2.8800us  fuse_multibox_prior_4_kernel0
                    0.00%  22.080us        10  2.2080us  1.9520us  2.7200us  fuse_conv2d_relu_6_kernel0
                    0.00%  21.728us        10  2.1720us  1.9200us  2.8480us  fuse_multibox_prior_5_kernel0
                    0.00%  20.480us        10  2.0480us  1.9200us  2.5600us  fuse_conv2d_5_kernel0
                    0.00%  18.720us        10  1.8720us  1.6320us  2.4000us  fuse_conv2d_11_kernel0
                    0.00%  11.488us         4  2.8720us  1.6640us  6.3680us  [CUDA memset]
      API calls:   32.33%  5.13432s       401  12.804ms  12.832us  4.58775s  cudaMalloc
                   29.21%  4.63843s       266  17.438ms  25.088us  362.97ms  cudaEventSynchronize
                   22.08%  3.50633s         8  438.29ms  42.272us  3.50602s  cudaStreamCreateWithFlags
                    6.48%  1.02988s       403  2.5555ms  1.9520us  283.78ms  cudaFree
                    4.72%  750.30ms      1044  718.68us  1.9200us  44.342ms  cudaEventRecord
                    3.23%  512.18ms        10  51.218ms  23.424us  80.045ms  cudaStreamSynchronize
                    0.98%  155.53ms      2133  72.915us  23.200us  1.6611ms  cudaLaunch
                    0.31%  49.279ms       850  57.975us  21.408us  219.58us  cuLaunchKernel
                    0.23%  37.049ms       159  233.01us  24.192us  3.3584ms  cudaMemcpy
                    0.10%  15.455ms         1  15.455ms  15.455ms  15.455ms  cuModuleLoadData
                    0.07%  10.714ms     16842     636ns     288ns  122.94us  cudaSetupArgument
                    0.06%  9.0210ms       949  9.5050us  3.8080us  168.86us  cudaBindTexture
                    0.04%  6.0506ms      1339  4.5180us  1.0560us  233.95us  cudaSetDevice
                    0.02%  3.7887ms         1  3.7887ms  3.7887ms  3.7887ms  cuModuleUnload
                    0.02%  3.6221ms       808  4.4820us  1.9200us  75.103us  cudaStreamWaitEvent
                    0.02%  3.3779ms       852  3.9640us  1.1520us  111.87us  cudaGetDevice
                    0.02%  3.2136ms      2133  1.5060us     384ns  96.224us  cudaConfigureCall
                    0.02%  3.1441ms       949  3.3130us  1.4720us  22.176us  cudaUnbindTexture
                    0.02%  2.8812ms       266  10.831us  4.0000us  112.51us  cudaEventElapsedTime
                    0.02%  2.7595ms      2636  1.0460us     320ns  32.256us  cudaGetLastError
                    0.01%  1.0056ms       149  6.7490us  1.6320us  17.280us  cudaEventDestroy
                    0.00%  733.12us        37  19.814us  12.288us  58.304us  cudaEventCreate
                    0.00%  644.74us       112  5.7560us  1.8560us  13.184us  cudaEventCreateWithFlags
                    0.00%  469.12us        64  7.3300us  1.2800us  32.032us  cuModuleGetFunction
                    0.00%  414.88us         1  414.88us  414.88us  414.88us  cudaFreeHost
                    0.00%  195.23us         4  48.808us  12.032us  155.30us  cudaMemsetAsync
                    0.00%  185.15us         1  185.15us  185.15us  185.15us  cudaHostAlloc
                    0.00%  178.43us         4  44.608us  43.072us  48.608us  cudaStreamCreateWithPriority
                    0.00%  128.38us       185     693ns     320ns  21.344us  cuDeviceGetAttribute
                    0.00%  85.760us        12  7.1460us  4.1280us  23.360us  cudaStreamDestroy
                    0.00%  43.776us         1  43.776us  43.776us  43.776us  cudaDeviceSynchronize
                    0.00%  34.464us        30  1.1480us     800ns  3.2000us  cudaDeviceGetAttribute
                    0.00%  24.096us         2  12.048us  9.0240us  15.072us  cudaThreadSynchronize
                    0.00%  19.168us         2  9.5840us  4.5760us  14.592us  cuDeviceTotalMem
                    0.00%  13.792us         1  13.792us  13.792us  13.792us  cudaGetDeviceCount
                    0.00%  5.7920us         1  5.7920us  5.7920us  5.7920us  cudaHostGetDevicePointer
                    0.00%  5.5360us         4  1.3840us     704ns  2.8160us  cuDeviceGetCount
                    0.00%  4.2560us         1  4.2560us  4.2560us  4.2560us  cuInit
                    0.00%  3.6800us         1  3.6800us  3.6800us  3.6800us  cuDriverGetVersion
                    0.00%  2.9760us         2  1.4880us     992ns  1.9840us  cuDeviceGetName
                    0.00%  2.4960us         3     832ns     576ns     992ns  cuDeviceGet
                    0.00%  2.3680us         1  2.3680us  2.3680us  2.3680us  cudaDeviceGetStreamPriorityRange

chenjunweii · December 13, 2018, 1:30pm

@masahi @reminisce
NVProf Log

mxnet

[12:59:02] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[12:59:03] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10500 with lower version of MXNet v10400. May cause undefined behavior. Please update MXNet if you encounter any issue
==6435== NVPROF is profiling process 6435, command: python detect_tvm_ssd_det.py
==6435== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
[12:59:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[13:00:05] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Time :  0.03228187561035156
Time :  0.03231477737426758
Time :  0.03212261199951172
Time :  0.03220796585083008
Time :  0.03215146064758301
Time :  0.03214383125305176
Time :  0.03216266632080078
Time :  0.0321197509765625
Time :  0.0322270393371582
Time :  0.032259225845336914
==6435== Profiling application: python detect_tvm_ssd_det.py
[13:00:37] src/engine/naive_engine.cc:69: Engine shutdown
==6435== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.35%  21.1777s      6894  3.0719ms  231.84us  432.84ms  maxwell_cgemm_64x64_tn
                   24.13%  10.5691s     11893  888.68us  839.67us  93.485ms  maxwell_gcgemm_64x64_nt
                    4.77%  2.08878s      2340  892.64us  158.40us  14.699ms  maxwell_sgemm_128x128_nt
                    4.65%  2.03762s      2294  888.24us  162.62us  15.071ms  maxwell_sgemm_128x128_nn
                    2.55%  1.11747s        40  27.937ms  877.59us  106.65ms  maxwell_cgemm_32x64_tn
                    1.68%  737.19ms      6840  107.78us  8.9600us  60.059ms  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *)
                    0.84%  369.00ms      5402  68.308us  6.1440us  126.68ms  void fft2d_r2c_32x32<float, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    0.81%  355.01ms      5882  60.355us  3.5200us  49.113ms  void fft2d_r2c_16x16<float>(float2*, float const *, int, int, int, int, int, int, int, int)
                    0.73%  318.53ms       210  1.5168ms  25.280us  158.52ms  void DSE::regular_fft_pad<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
                    0.57%  249.74ms      2300  108.58us  6.0800us  123.73ms  void fft2d_r2c_32x32<float, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    0.56%  246.82ms      2260  109.21us  49.503us  579.67us  maxwell_sgemm_128x64_nt
                    0.49%  214.37ms      6934  30.916us  3.0400us  4.3442ms  compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
                    0.47%  203.77ms       210  970.34us  18.656us  92.703ms  void DSE::vector_fft<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.46%  199.98ms        69  2.8983ms  80.222us  47.964ms  maxwell_gcgemm_32x32_nt
                    0.42%  184.82ms      2941  62.843us  4.3200us  42.412ms  void fft2d_c2r_16x16<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
                    0.40%  174.22ms      3492  49.891us  9.2800us  23.737ms  void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
                    0.39%  169.90ms      4576  37.128us  7.4230us  574.14us  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt
                    0.38%  164.55ms       105  1.5671ms  50.463us  70.989ms  void DSE::regular_fft_clip<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
                    0.37%  162.21ms       576  281.61us  149.37us  508.25us  void fermiCgemm_v3_kernel<bool=1, bool=0, bool=0, bool=0, int=5, int=5, int=3, int=8, int=8>(int, int, int, float2 const *, int, float2 const *, int, float2*, int, int, int, float2 const *, float2 const *, float2, float2, int)
                    0.35%  151.73ms        12  12.644ms  4.2643ms  27.578ms  void pointwise_mult_and_sum_complex<float2, int=8, int=4>(float2*, float2*, float2*, int, int, int, int, int, float2)
                    0.25%  110.52ms       882  125.30us  10.304us  16.130ms  void DSE::regular_fft_pad<int=0, int=1, int=128, int=16, int=32, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
                    0.25%  108.32ms      1060  102.19us  7.3600us  1.5049ms  void cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>(float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>, cudnnTensorStruct, float const *, float, cudnnTensorStruct*, float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const *, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>)
                    0.24%  103.58ms       441  234.87us  30.015us  31.348ms  void DSE::regular_fft_clip<int=1, int=2, int=128, int=16, int=32, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
                    0.23%  100.17ms     12268  8.1640us  6.0800us  249.60us  void fft2d_r2c_32x32<float, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
                    0.22%  98.205ms      4833  20.319us  6.4640us  22.589ms  void fft2d_c2r_32x32<float, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*)
                    0.21%  92.713ms      1746  53.100us  10.239us  21.309ms  void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
                    0.21%  92.413ms       880  105.02us  3.2640us  1.2833ms  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
                    0.21%  92.105ms         6  15.351ms  2.6429ms  35.988ms  cudnn_maxwell_cgemm_64x64_tn_batched
                    0.20%  89.209ms       882  101.14us  5.4400us  14.765ms  void DSE::vector_fft<int=0, int=1, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.20%  86.793ms      4600  18.868us  4.4800us  1.6882ms  void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
                    0.18%  79.550ms       210  378.81us  199.84us  1.1408ms  void conv2d_c1_k1_nchw_hw_packed_kernel<float, float, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, int)
                    0.18%  77.497ms       274  282.83us  183.04us  5.6100ms  maxwell_scudnn_128x32_stridedB_splitK_large_nn
                    0.17%  73.959ms       244  303.11us  40.320us  1.4893ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
                    0.17%  73.075ms      4600  15.885us  4.0960us  1.5974ms  void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradOutputParams<float, float>)
                    0.17%  72.596ms      4282  16.953us  1.2160us  15.880ms  void flip_filter<float, float>(float*, float const *, int, int, int, int)
                    0.16%  69.863ms       164  426.00us  123.61us  4.0056ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.15%  67.758ms      8800  7.6990us  6.0800us  25.664us  void fft2d_c2r_32x32<float, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*)
                    0.15%  66.270ms      3514  18.858us  6.4640us  2.3958ms  maxwell_scudnn_128x32_relu_small_nn
                    0.15%  65.649ms         1  65.649ms  65.649ms  65.649ms  void cudnn::detail::dgrad2d_alg1_1<float, int=0, int=6, int=7, int=5, int=4, int=5, bool=1, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad2d_alg1_1<float, int=0, int=6, int=7, int=5, int=4, int=5, bool=1, bool=1>*, kernel_grad_params, int, int, float, int, int)
                    0.13%  56.292ms       105  536.11us  20.319us  22.904ms  void DSE::vector_fft<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.12%  54.625ms      4820  11.332us  3.3600us  673.59us  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
                    0.12%  54.607ms       248  220.19us  69.919us  418.94us  maxwell_scudnn_128x128_relu_interior_nn
                    0.12%  52.703ms      1032  51.068us  11.456us  4.9740ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.12%  52.081ms       120  434.01us  34.079us  7.9281ms  void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    0.11%  50.127ms       207  242.16us  111.58us  495.35us  maxwell_scudnn_128x64_relu_interior_nn
                    0.11%  48.450ms      2506  19.333us  9.2800us  3.3782ms  maxwell_scudnn_128x32_stridedB_splitK_small_nn
                    0.10%  45.381ms       169  268.52us  89.598us  1.9236ms  maxwell_scudnn_128x32_relu_interior_nn
                    0.10%  45.364ms       441  102.87us  9.5990us  14.754ms  void DSE::vector_fft<int=1, int=2, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.10%  44.431ms       720  61.709us  59.679us  71.903us  maxwell_scudnn_128x32_stridedB_splitK_medium_nn
                    0.10%  42.637ms      3609  11.813us  4.3200us  1.6868ms  void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
                    0.10%  42.563ms       150  283.75us  99.199us  1.2750ms  void conv2d_grouped_direct_kernel<float, float, float, float, float, bool=1, bool=0, int=0, int=1, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, int, float const *, float const *, cudnnActivationStruct)
                    0.09%  40.021ms      2300  17.400us  4.6390us  205.85us  void cudnn::winograd_nonfused::winogradWgradData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
                    0.09%  39.725ms        30  1.3242ms  220.25us  7.7380ms  maxwell_gcgemm_64x32_nt
                    0.09%  38.967ms        24  1.6236ms  37.439us  11.390ms  void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    0.09%  37.783ms       119  317.50us  47.743us  3.0068ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.08%  36.332ms        30  1.2111ms  178.49us  4.0763ms  void dgrad2d_grouped_direct_kernel<float, float, bool=1, int=1, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor)
                    0.08%  36.230ms         4  9.0576ms  5.4853ms  12.727ms  cudnn_maxwell_cgemm_32x64_tn_batched
                    0.08%  33.743ms        12  2.8119ms  349.60us  8.9944ms  maxwell_scudnn_128x128_stridedB_splitK_small_nn
                    0.07%  32.838ms      4600  7.1380us  1.6640us  1.1086ms  void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
                    0.07%  29.666ms      2576  11.516us  5.5030us  1.9406ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.07%  29.033ms        30  967.75us  379.26us  1.8717ms  void wgrad2d_grouped_direct_kernel<float, float>(cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, cudnnConvolutionStruct, cudnnFilterStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, int)
                    0.06%  28.074ms      2300  12.206us  7.4560us  2.6572ms  void cudnn::winograd_nonfused::winogradWgradOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradWgradOutputParams<float, float>)
                    0.06%  26.482ms        26  1.0185ms  225.28us  5.0909ms  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    0.06%  25.497ms       306  83.324us  7.6160us  3.8221ms  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.06%  24.909ms         2  12.454ms  9.7207ms  15.188ms  cudnn_maxwell_cgemm_64x32_tn_batched
                    0.05%  23.487ms      2300  10.211us  3.1990us  109.06us  void cudnn::winograd_nonfused::winogradWgradDelta4x4<float, float>(cudnn::winograd_nonfused::WinogradDeltaParams<float, float>)
                    0.04%  18.692ms        32  584.11us  16.864us  3.2814ms  void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                    0.04%  17.364ms        16  1.0852ms  56.415us  3.5814ms  maxwell_scudnn_128x128_stridedB_splitK_interior_nn
                    0.03%  13.862ms       332  41.752us     576ns  2.1816ms  [CUDA memcpy HtoD]
                    0.03%  13.822ms        35  394.90us  23.840us  1.6082ms  void cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                    0.03%  13.664ms        18  759.13us  227.90us  2.1301ms  maxwell_scudnn_128x32_stridedB_splitK_interior_nn
                    0.03%  13.467ms        26  517.98us  127.58us  1.8329ms  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    0.02%  10.577ms        14  755.50us  335.90us  1.7558ms  maxwell_scudnn_128x64_stridedB_splitK_interior_nn
                    0.02%  10.408ms      3560  2.9230us  1.2160us  49.183us  cudnn::maxwell::gemm::computeWgradOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
                    0.02%  10.105ms      1152  8.7710us  2.7200us  27.840us  void gemmk1_kernel<float2, int=256, int=5, bool=1, bool=0, bool=0, bool=0>(cublasGemmk1Params<float2>, float2 const *, float2 const *, float2*)
                    0.02%  9.9490ms      4176  2.3820us  1.2800us  85.854us  cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
                    0.02%  9.9077ms      3651  2.7130us     799ns  354.11us  void scalePackedTensor_kernel<float, float>(cudnnTensor4dStruct, float*, float)
                    0.02%  8.2953ms        20  414.77us  161.85us  1.2357ms  void cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                    0.02%  8.2477ms        96  85.913us  19.776us  277.28us  void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int, int3, int3, int2, int, float, float, float*, float*)
                    0.01%  5.8701ms         6  978.34us  278.72us  2.3644ms  maxwell_scudnn_128x128_relu_small_nn
                    0.01%  5.4440ms        30  181.47us  44.800us  444.79us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::TransposeExExp<mshadow::Tensor<mshadow::gpu, int=3, float>, float, int=3>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    0.01%  4.8118ms      3592  1.3390us  1.0560us  11.712us  cudnn::maxwell::gemm::computeBOffsetsKernel(cudnn::maxwell::gemm::ComputeBOffsetsParams)
                    0.01%  4.4570ms       200  22.284us  5.7600us  58.079us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
                    0.01%  3.8114ms       340  11.209us  3.9360us  44.800us  void add_tensor_kernel_v3<int=2, float, float, int=128, int=1, int=1, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
                    0.01%  3.5417ms         6  590.28us  261.18us  1.0680ms  maxwell_sgemm_128x64_nn
                    0.01%  3.4490ms        30  114.97us  15.680us  531.19us  void cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    0.01%  3.3751ms       240  14.062us  3.5200us  43.359us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=4, float>, float>, mshadow::expr::Plan<mshadow::expr::TransposeExExp<mshadow::Tensor<mshadow::gpu, int=4, float>, float, int=4>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=4)
                    0.01%  3.3494ms         8  418.68us  65.023us  1.2452ms  maxwell_scudnn_128x128_stridedB_small_nn
                    0.01%  2.7486ms         4  687.15us  315.74us  1.6725ms  void cudnn::detail::dgrad2d_alg1_1<float, int=0, int=4, int=6, int=3, int=2, int=4, bool=1, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad2d_alg1_1<float, int=0, int=4, int=6, int=3, int=2, int=4, bool=1, bool=1>*, kernel_grad_params, int, int, float, int, int)
                    0.01%  2.3732ms         7  339.02us  120.38us  660.63us  maxwell_scudnn_128x32_stridedB_interior_nn
                    0.00%  2.0287ms       360  5.6350us  1.9190us  44.959us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    0.00%  1.9651ms        10  196.51us  55.679us  386.81us  maxwell_scudnn_128x128_stridedB_interior_nn
                    0.00%  1.9393ms         7  277.04us  118.82us  541.11us  maxwell_scudnn_128x64_stridedB_interior_nn
                    0.00%  1.6591ms         4  414.76us  313.92us  517.11us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.00%  1.6013ms        10  160.13us  159.61us  161.12us  _ZN5mxnet2op8mxnet_op23mxnet_generic_kernel_exINS1_23binary_broadcast_kernelILi2EfNS0_10mshadow_op5minusEEEJNS_9OpReqTypeEN7mshadow5ShapeILi2EEESA_SA_PfSB_SB_EEEviDpT0_
                    0.00%  1.5004ms        20  75.018us  71.839us  77.759us  [CUDA memcpy DtoD]
                    0.00%  1.0625ms       600  1.7700us  1.2800us  6.0800us  void mshadow::cuda::AssignPriors<float>(float*, float, float, int, int, float, float, float, float, int, int)
                    0.00%  647.51us         1  647.51us  647.51us  647.51us  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.00%  495.90us         2  247.95us  82.623us  413.27us  void cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, bool=0>*, kernel_grad_params, int, int, float, int)
                    0.00%  477.50us        20  23.875us  21.760us  39.360us  void cudnn::detail::softmax_fw_channel_4d_kernel<float, float, int=256, int=1>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_channel_4d_kernel<float, float, int=256, int=1>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
                    0.00%  436.95us         3  145.65us  101.21us  178.46us  void cudnn::detail::wgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, cudnn::detail::wgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>*, float const , kernel_grad_params, int, float, float, int, int, int*, kernel_grad_params, int, int)
                    0.00%  413.85us        60  6.8970us  6.0800us  10.816us  void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
                    0.00%  366.46us         2  183.23us  55.263us  311.20us  void cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>*, kernel_grad_params, int, int, float, int)
                    0.00%  126.02us       132     954ns     256ns  5.8240us  [CUDA memset]
                    0.00%  58.207us         2  29.103us  23.423us  34.784us  void cudnn::detail::wgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>(int, int, int, float const *, int, cudnn::detail::wgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>*, float const , kernel_grad_params, int, float, float, int, int, int*, kernel_grad_params, int, int)
      API calls:   42.89%  26.1098s       859  30.396ms  6.9430us  1.95466s  cudaEventSynchronize
                   26.07%  15.8717s    162083  97.923us  20.864us  13.894ms  cudaLaunch
                    7.95%  4.84120s       486  9.9613ms  5.7280us  4.83418s  cudaMemGetInfo
                    6.36%  3.86955s       837  4.6231ms  1.5360us  1.00160s  cudaFree
                    5.35%  3.25567s        11  295.97ms  76.031us  3.25470s  cudaStreamCreateWithFlags
                    4.51%  2.74332s      1180  2.3249ms  16.480us  216.81ms  cudaMalloc
                    3.50%  2.13095s     71875  29.648us  1.6310us  41.668ms  cudaEventRecord
                    1.26%  767.62ms   1629948     470ns     287ns  1.3909ms  cudaSetupArgument
                    0.98%  597.08ms    110330  5.4110us  1.6320us  1.7959ms  cudaStreamWaitEvent
                    0.42%  258.02ms       692  372.86us  6.0800us  13.949ms  cudaStreamSynchronize
                    0.18%  109.38ms    167318     653ns     256ns  1.5854ms  cudaGetLastError
                    0.18%  108.66ms    162083     670ns     383ns  618.42us  cudaConfigureCall
                    0.09%  57.635ms      8931  6.4530us  3.8400us  145.66us  cudaBindTexture
                    0.07%  40.900ms       346  118.21us  25.856us  2.4513ms  cudaMemcpy2DAsync
                    0.05%  31.400ms      4651  6.7510us  3.0720us  78.110us  cudaEventCreate
                    0.04%  24.281ms      8931  2.7180us  1.4080us  30.176us  cudaUnbindTexture
                    0.03%  18.744ms      4954  3.7830us  1.8240us  120.54us  cudaEventDestroy
                    0.02%  10.857ms       859  12.639us  3.5840us  114.94us  cudaEventElapsedTime
                    0.01%  8.7803ms       132  66.517us  13.056us  179.84us  cudaMemsetAsync
                    0.01%  8.3440ms      4024  2.0730us     832ns  26.752us  cudaSetDevice
                    0.01%  5.9750ms      1714  3.4850us     736ns  50.815us  cudaDeviceGetAttribute
                    0.01%  3.3216ms       414  8.0230us  1.5680us  70.975us  cudaEventCreateWithFlags
                    0.00%  2.8665ms      1204  2.3800us     896ns  58.815us  cudaGetDevice
                    0.00%  972.91us      1386     701ns     384ns  2.8480us  cudaPeekAtLastError
                    0.00%  602.04us         6  100.34us  28.639us  268.99us  cudaMemcpy
                    0.00%  378.65us       367  1.0310us     320ns  41.727us  cuDeviceGetAttribute
                    0.00%  307.32us         4  76.830us  76.511us  77.119us  cudaStreamCreateWithPriority
                    0.00%  220.45us         1  220.45us  220.45us  220.45us  cudaHostAlloc
                    0.00%  134.88us         1  134.88us  134.88us  134.88us  cudaStreamCreate
                    0.00%  49.086us         4  12.271us  6.1760us  16.223us  cuDeviceTotalMem
                    0.00%  36.479us         1  36.479us  36.479us  36.479us  cudaGetDeviceProperties
                    0.00%  17.184us         6  2.8640us     992ns  8.1280us  cuDeviceGetCount
                    0.00%  12.800us         2  6.4000us  4.8960us  7.9040us  cudaGetDeviceCount
                    0.00%  7.2000us         4  1.8000us  1.2800us  3.2320us  cuDeviceGetName
                    0.00%  7.0080us         5  1.4010us     480ns  3.1360us  cuDeviceGet
                    0.00%  5.9200us         3  1.9730us  1.8240us  2.1760us  cuInit
                    0.00%  4.8320us         1  4.8320us  4.8320us  4.8320us  cudaHostGetDevicePointer
                    0.00%  3.9680us         3  1.3220us  1.0240us  1.6000us  cuDriverGetVersion
                    0.00%  2.4000us         1  2.4000us  2.4000us  2.4000us  cudaDeviceGetStreamPriorityRange

I found they use different gpu kerenel

tvm use something like fuse_conv2d_broadcast_add_relu_11_kernel0

but mxnet didn’t use that kind of kernel

By the way, Do i have to care about the following warning ? Or I can ignore it ?

WARNING:autotvm:Cannot find config for target=cuda -libs=cudnn,cublas, workload=('conv2d', (1, 128, 1, 1, 'float32'), (16, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.

masahi · December 13, 2018, 2:06pm

Thanks. It looks like your benchmarking code is wrong (Because the number of kernel calls are very different between the two). Can you share the scripts for TVM and MXNet?

fuse_conv2d_broadcast_add_relu_11_kernel0 means a fused operator where convolution are done by cuDNN, and bias add + relu are fused into a single kernel by TVM.

I’m not sure why you are getting that log from autotvm. If you use cuDNN, all conv2d operators should be replaced with cuDNN kernels, so autotvm shouldn’t be called.

chenjunweii · December 14, 2018, 1:14am

@masahi @eqy
By the way I have tried to compile ssd using the example code on the tvm tutorial with gpu but falied (cpu works properly), it seems some part of ssd cannot run on gpu using tvm, so I split the original mxnet ssd symbol into 2 part

Neural network part for GPU
Nms part for CPU

code is here

splited ssd symbol (GPU Part), tvm code, mxnet code

I still got those warning, maybe my configuration of the target is wrong ?

masahi · December 14, 2018, 4:43am

ok, thanks. I’ll take a look.

merrymercy · December 14, 2018, 9:25am

You can safely ignore the warnings.

Even if cudnn is enabled, the compute function is still an autotvm template, so you see the warning.

masahi · December 16, 2018, 3:12pm

@chenjunweii ok, I found why TVM is slower than mxnet on your network.

The reason is your network use depth wise conv a lot, and currently we don’t support cudnn offload for depth wise conv. You can see them in the profiler output below. The one starting with “fuse_” is depth wise conv kernel generated by TVM.

==10393== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   12.33%  951.33ms      3003  316.79us  304.67us  1.1991ms  fuse_conv2d_negative_elemwise_mul_elemwise_add_expand_dims_broadcast_add_relu_10_kernel0
                   10.52%  812.03ms     12024  67.533us  16.127us  955.42us  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
                   10.45%  806.62ms      2002  402.91us  381.25us  1.5965ms  fuse_conv2d_negative_elemwise_mul_elemwise_add_expand_dims_broadcast_add_relu_13_kernel0
                    6.29%  485.52ms      3003  161.68us  154.14us  1.0057ms  fuse_conv2d_negative_elemwise_mul_elemwise_add_expand_dims_broadcast_add_relu_8_kernel0

Tvm + cudnn slower than mxnet + cudnn

NVProf Log

@masahi @reminisce NVProf Log

@masahi @reminisce
NVProf Log