mxnet
[12:59:02] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[12:59:03] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10500 with lower version of MXNet v10400. May cause undefined behavior. Please update MXNet if you encounter any issue
==6435== NVPROF is profiling process 6435, command: python detect_tvm_ssd_det.py
==6435== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
[12:59:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[13:00:05] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Time : 0.03228187561035156
Time : 0.03231477737426758
Time : 0.03212261199951172
Time : 0.03220796585083008
Time : 0.03215146064758301
Time : 0.03214383125305176
Time : 0.03216266632080078
Time : 0.0321197509765625
Time : 0.0322270393371582
Time : 0.032259225845336914
==6435== Profiling application: python detect_tvm_ssd_det.py
[13:00:37] src/engine/naive_engine.cc:69: Engine shutdown
==6435== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 48.35% 21.1777s 6894 3.0719ms 231.84us 432.84ms maxwell_cgemm_64x64_tn
24.13% 10.5691s 11893 888.68us 839.67us 93.485ms maxwell_gcgemm_64x64_nt
4.77% 2.08878s 2340 892.64us 158.40us 14.699ms maxwell_sgemm_128x128_nt
4.65% 2.03762s 2294 888.24us 162.62us 15.071ms maxwell_sgemm_128x128_nn
2.55% 1.11747s 40 27.937ms 877.59us 106.65ms maxwell_cgemm_32x64_tn
1.68% 737.19ms 6840 107.78us 8.9600us 60.059ms void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *)
0.84% 369.00ms 5402 68.308us 6.1440us 126.68ms void fft2d_r2c_32x32<float, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
0.81% 355.01ms 5882 60.355us 3.5200us 49.113ms void fft2d_r2c_16x16<float>(float2*, float const *, int, int, int, int, int, int, int, int)
0.73% 318.53ms 210 1.5168ms 25.280us 158.52ms void DSE::regular_fft_pad<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
0.57% 249.74ms 2300 108.58us 6.0800us 123.73ms void fft2d_r2c_32x32<float, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
0.56% 246.82ms 2260 109.21us 49.503us 579.67us maxwell_sgemm_128x64_nt
0.49% 214.37ms 6934 30.916us 3.0400us 4.3442ms compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
0.47% 203.77ms 210 970.34us 18.656us 92.703ms void DSE::vector_fft<int=0, int=1, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
0.46% 199.98ms 69 2.8983ms 80.222us 47.964ms maxwell_gcgemm_32x32_nt
0.42% 184.82ms 2941 62.843us 4.3200us 42.412ms void fft2d_c2r_16x16<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
0.40% 174.22ms 3492 49.891us 9.2800us 23.737ms void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
0.39% 169.90ms 4576 37.128us 7.4230us 574.14us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt
0.38% 164.55ms 105 1.5671ms 50.463us 70.989ms void DSE::regular_fft_clip<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
0.37% 162.21ms 576 281.61us 149.37us 508.25us void fermiCgemm_v3_kernel<bool=1, bool=0, bool=0, bool=0, int=5, int=5, int=3, int=8, int=8>(int, int, int, float2 const *, int, float2 const *, int, float2*, int, int, int, float2 const *, float2 const *, float2, float2, int)
0.35% 151.73ms 12 12.644ms 4.2643ms 27.578ms void pointwise_mult_and_sum_complex<float2, int=8, int=4>(float2*, float2*, float2*, int, int, int, int, int, float2)
0.25% 110.52ms 882 125.30us 10.304us 16.130ms void DSE::regular_fft_pad<int=0, int=1, int=128, int=16, int=32, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
0.25% 108.32ms 1060 102.19us 7.3600us 1.5049ms void cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>(float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>, cudnnTensorStruct, float const *, float, cudnnTensorStruct*, float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const *, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>)
0.24% 103.58ms 441 234.87us 30.015us 31.348ms void DSE::regular_fft_clip<int=1, int=2, int=128, int=16, int=32, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
0.23% 100.17ms 12268 8.1640us 6.0800us 249.60us void fft2d_r2c_32x32<float, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool)
0.22% 98.205ms 4833 20.319us 6.4640us 22.589ms void fft2d_c2r_32x32<float, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*)
0.21% 92.713ms 1746 53.100us 10.239us 21.309ms void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
0.21% 92.413ms 880 105.02us 3.2640us 1.2833ms void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
0.21% 92.105ms 6 15.351ms 2.6429ms 35.988ms cudnn_maxwell_cgemm_64x64_tn_batched
0.20% 89.209ms 882 101.14us 5.4400us 14.765ms void DSE::vector_fft<int=0, int=1, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
0.20% 86.793ms 4600 18.868us 4.4800us 1.6882ms void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
0.18% 79.550ms 210 378.81us 199.84us 1.1408ms void conv2d_c1_k1_nchw_hw_packed_kernel<float, float, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, int)
0.18% 77.497ms 274 282.83us 183.04us 5.6100ms maxwell_scudnn_128x32_stridedB_splitK_large_nn
0.17% 73.959ms 244 303.11us 40.320us 1.4893ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
0.17% 73.075ms 4600 15.885us 4.0960us 1.5974ms void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradOutputParams<float, float>)
0.17% 72.596ms 4282 16.953us 1.2160us 15.880ms void flip_filter<float, float>(float*, float const *, int, int, int, int)
0.16% 69.863ms 164 426.00us 123.61us 4.0056ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.15% 67.758ms 8800 7.6990us 6.0800us 25.664us void fft2d_c2r_32x32<float, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*)
0.15% 66.270ms 3514 18.858us 6.4640us 2.3958ms maxwell_scudnn_128x32_relu_small_nn
0.15% 65.649ms 1 65.649ms 65.649ms 65.649ms void cudnn::detail::dgrad2d_alg1_1<float, int=0, int=6, int=7, int=5, int=4, int=5, bool=1, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad2d_alg1_1<float, int=0, int=6, int=7, int=5, int=4, int=5, bool=1, bool=1>*, kernel_grad_params, int, int, float, int, int)
0.13% 56.292ms 105 536.11us 20.319us 22.904ms void DSE::vector_fft<int=1, int=2, int=256, int=16, int=16, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
0.12% 54.625ms 4820 11.332us 3.3600us 673.59us void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
0.12% 54.607ms 248 220.19us 69.919us 418.94us maxwell_scudnn_128x128_relu_interior_nn
0.12% 52.703ms 1032 51.068us 11.456us 4.9740ms void cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.12% 52.081ms 120 434.01us 34.079us 7.9281ms void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
0.11% 50.127ms 207 242.16us 111.58us 495.35us maxwell_scudnn_128x64_relu_interior_nn
0.11% 48.450ms 2506 19.333us 9.2800us 3.3782ms maxwell_scudnn_128x32_stridedB_splitK_small_nn
0.10% 45.381ms 169 268.52us 89.598us 1.9236ms maxwell_scudnn_128x32_relu_interior_nn
0.10% 45.364ms 441 102.87us 9.5990us 14.754ms void DSE::vector_fft<int=1, int=2, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
0.10% 44.431ms 720 61.709us 59.679us 71.903us maxwell_scudnn_128x32_stridedB_splitK_medium_nn
0.10% 42.637ms 3609 11.813us 4.3200us 1.6868ms void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
0.10% 42.563ms 150 283.75us 99.199us 1.2750ms void conv2d_grouped_direct_kernel<float, float, float, float, float, bool=1, bool=0, int=0, int=1, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, int, float const *, float const *, cudnnActivationStruct)
0.09% 40.021ms 2300 17.400us 4.6390us 205.85us void cudnn::winograd_nonfused::winogradWgradData4x4<float, float>(cudnn::winograd_nonfused::WinogradDataParams<float, float>)
0.09% 39.725ms 30 1.3242ms 220.25us 7.7380ms maxwell_gcgemm_64x32_nt
0.09% 38.967ms 24 1.6236ms 37.439us 11.390ms void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
0.09% 37.783ms 119 317.50us 47.743us 3.0068ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.08% 36.332ms 30 1.2111ms 178.49us 4.0763ms void dgrad2d_grouped_direct_kernel<float, float, bool=1, int=1, int=3>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnnTensorStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor)
0.08% 36.230ms 4 9.0576ms 5.4853ms 12.727ms cudnn_maxwell_cgemm_32x64_tn_batched
0.08% 33.743ms 12 2.8119ms 349.60us 8.9944ms maxwell_scudnn_128x128_stridedB_splitK_small_nn
0.07% 32.838ms 4600 7.1380us 1.6640us 1.1086ms void cudnn::winograd_nonfused::winogradForwardFilter4x4<float, float>(cudnn::winograd_nonfused::WinogradFilterParams<float, float>)
0.07% 29.666ms 2576 11.516us 5.5030us 1.9406ms void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.07% 29.033ms 30 967.75us 379.26us 1.8717ms void wgrad2d_grouped_direct_kernel<float, float>(cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, cudnnConvolutionStruct, cudnnFilterStruct, float*, float, float, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, cudnn::reduced_divisor, int)
0.06% 28.074ms 2300 12.206us 7.4560us 2.6572ms void cudnn::winograd_nonfused::winogradWgradOutput4x4<float, float>(cudnn::winograd_nonfused::WinogradWgradOutputParams<float, float>)
0.06% 26.482ms 26 1.0185ms 225.28us 5.0909ms void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
0.06% 25.497ms 306 83.324us 7.6160us 3.8221ms void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
0.06% 24.909ms 2 12.454ms 9.7207ms 15.188ms cudnn_maxwell_cgemm_64x32_tn_batched
0.05% 23.487ms 2300 10.211us 3.1990us 109.06us void cudnn::winograd_nonfused::winogradWgradDelta4x4<float, float>(cudnn::winograd_nonfused::WinogradDeltaParams<float, float>)
0.04% 18.692ms 32 584.11us 16.864us 3.2814ms void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
0.04% 17.364ms 16 1.0852ms 56.415us 3.5814ms maxwell_scudnn_128x128_stridedB_splitK_interior_nn
0.03% 13.862ms 332 41.752us 576ns 2.1816ms [CUDA memcpy HtoD]
0.03% 13.822ms 35 394.90us 23.840us 1.6082ms void cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
0.03% 13.664ms 18 759.13us 227.90us 2.1301ms maxwell_scudnn_128x32_stridedB_splitK_interior_nn
0.03% 13.467ms 26 517.98us 127.58us 1.8329ms void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
0.02% 10.577ms 14 755.50us 335.90us 1.7558ms maxwell_scudnn_128x64_stridedB_splitK_interior_nn
0.02% 10.408ms 3560 2.9230us 1.2160us 49.183us cudnn::maxwell::gemm::computeWgradOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.02% 10.105ms 1152 8.7710us 2.7200us 27.840us void gemmk1_kernel<float2, int=256, int=5, bool=1, bool=0, bool=0, bool=0>(cublasGemmk1Params<float2>, float2 const *, float2 const *, float2*)
0.02% 9.9490ms 4176 2.3820us 1.2800us 85.854us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.02% 9.9077ms 3651 2.7130us 799ns 354.11us void scalePackedTensor_kernel<float, float>(cudnnTensor4dStruct, float*, float)
0.02% 8.2953ms 20 414.77us 161.85us 1.2357ms void cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
0.02% 8.2477ms 96 85.913us 19.776us 277.28us void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int, int3, int3, int2, int, float, float, float*, float*)
0.01% 5.8701ms 6 978.34us 278.72us 2.3644ms maxwell_scudnn_128x128_relu_small_nn
0.01% 5.4440ms 30 181.47us 44.800us 444.79us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::TransposeExExp<mshadow::Tensor<mshadow::gpu, int=3, float>, float, int=3>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
0.01% 4.8118ms 3592 1.3390us 1.0560us 11.712us cudnn::maxwell::gemm::computeBOffsetsKernel(cudnn::maxwell::gemm::ComputeBOffsetsParams)
0.01% 4.4570ms 200 22.284us 5.7600us 58.079us _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
0.01% 3.8114ms 340 11.209us 3.9360us 44.800us void add_tensor_kernel_v3<int=2, float, float, int=128, int=1, int=1, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.01% 3.5417ms 6 590.28us 261.18us 1.0680ms maxwell_sgemm_128x64_nn
0.01% 3.4490ms 30 114.97us 15.680us 531.19us void cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
0.01% 3.3751ms 240 14.062us 3.5200us 43.359us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=4, float>, float>, mshadow::expr::Plan<mshadow::expr::TransposeExExp<mshadow::Tensor<mshadow::gpu, int=4, float>, float, int=4>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=4)
0.01% 3.3494ms 8 418.68us 65.023us 1.2452ms maxwell_scudnn_128x128_stridedB_small_nn
0.01% 2.7486ms 4 687.15us 315.74us 1.6725ms void cudnn::detail::dgrad2d_alg1_1<float, int=0, int=4, int=6, int=3, int=2, int=4, bool=1, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad2d_alg1_1<float, int=0, int=4, int=6, int=3, int=2, int=4, bool=1, bool=1>*, kernel_grad_params, int, int, float, int, int)
0.01% 2.3732ms 7 339.02us 120.38us 660.63us maxwell_scudnn_128x32_stridedB_interior_nn
0.00% 2.0287ms 360 5.6350us 1.9190us 44.959us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
0.00% 1.9651ms 10 196.51us 55.679us 386.81us maxwell_scudnn_128x128_stridedB_interior_nn
0.00% 1.9393ms 7 277.04us 118.82us 541.11us maxwell_scudnn_128x64_stridedB_interior_nn
0.00% 1.6591ms 4 414.76us 313.92us 517.11us void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.00% 1.6013ms 10 160.13us 159.61us 161.12us _ZN5mxnet2op8mxnet_op23mxnet_generic_kernel_exINS1_23binary_broadcast_kernelILi2EfNS0_10mshadow_op5minusEEEJNS_9OpReqTypeEN7mshadow5ShapeILi2EEESA_SA_PfSB_SB_EEEviDpT0_
0.00% 1.5004ms 20 75.018us 71.839us 77.759us [CUDA memcpy DtoD]
0.00% 1.0625ms 600 1.7700us 1.2800us 6.0800us void mshadow::cuda::AssignPriors<float>(float*, float, float, int, int, float, float, float, float, int, int)
0.00% 647.51us 1 647.51us 647.51us 647.51us void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.00% 495.90us 2 247.95us 82.623us 413.27us void cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1, bool=0>*, kernel_grad_params, int, int, float, int)
0.00% 477.50us 20 23.875us 21.760us 39.360us void cudnn::detail::softmax_fw_channel_4d_kernel<float, float, int=256, int=1>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_channel_4d_kernel<float, float, int=256, int=1>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
0.00% 436.95us 3 145.65us 101.21us 178.46us void cudnn::detail::wgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, cudnn::detail::wgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>*, float const , kernel_grad_params, int, float, float, int, int, int*, kernel_grad_params, int, int)
0.00% 413.85us 60 6.8970us 6.0800us 10.816us void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.00% 366.46us 2 183.23us 55.263us 311.20us void cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_alg1_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1, bool=0>*, kernel_grad_params, int, int, float, int)
0.00% 126.02us 132 954ns 256ns 5.8240us [CUDA memset]
0.00% 58.207us 2 29.103us 23.423us 34.784us void cudnn::detail::wgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>(int, int, int, float const *, int, cudnn::detail::wgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>*, float const , kernel_grad_params, int, float, float, int, int, int*, kernel_grad_params, int, int)
API calls: 42.89% 26.1098s 859 30.396ms 6.9430us 1.95466s cudaEventSynchronize
26.07% 15.8717s 162083 97.923us 20.864us 13.894ms cudaLaunch
7.95% 4.84120s 486 9.9613ms 5.7280us 4.83418s cudaMemGetInfo
6.36% 3.86955s 837 4.6231ms 1.5360us 1.00160s cudaFree
5.35% 3.25567s 11 295.97ms 76.031us 3.25470s cudaStreamCreateWithFlags
4.51% 2.74332s 1180 2.3249ms 16.480us 216.81ms cudaMalloc
3.50% 2.13095s 71875 29.648us 1.6310us 41.668ms cudaEventRecord
1.26% 767.62ms 1629948 470ns 287ns 1.3909ms cudaSetupArgument
0.98% 597.08ms 110330 5.4110us 1.6320us 1.7959ms cudaStreamWaitEvent
0.42% 258.02ms 692 372.86us 6.0800us 13.949ms cudaStreamSynchronize
0.18% 109.38ms 167318 653ns 256ns 1.5854ms cudaGetLastError
0.18% 108.66ms 162083 670ns 383ns 618.42us cudaConfigureCall
0.09% 57.635ms 8931 6.4530us 3.8400us 145.66us cudaBindTexture
0.07% 40.900ms 346 118.21us 25.856us 2.4513ms cudaMemcpy2DAsync
0.05% 31.400ms 4651 6.7510us 3.0720us 78.110us cudaEventCreate
0.04% 24.281ms 8931 2.7180us 1.4080us 30.176us cudaUnbindTexture
0.03% 18.744ms 4954 3.7830us 1.8240us 120.54us cudaEventDestroy
0.02% 10.857ms 859 12.639us 3.5840us 114.94us cudaEventElapsedTime
0.01% 8.7803ms 132 66.517us 13.056us 179.84us cudaMemsetAsync
0.01% 8.3440ms 4024 2.0730us 832ns 26.752us cudaSetDevice
0.01% 5.9750ms 1714 3.4850us 736ns 50.815us cudaDeviceGetAttribute
0.01% 3.3216ms 414 8.0230us 1.5680us 70.975us cudaEventCreateWithFlags
0.00% 2.8665ms 1204 2.3800us 896ns 58.815us cudaGetDevice
0.00% 972.91us 1386 701ns 384ns 2.8480us cudaPeekAtLastError
0.00% 602.04us 6 100.34us 28.639us 268.99us cudaMemcpy
0.00% 378.65us 367 1.0310us 320ns 41.727us cuDeviceGetAttribute
0.00% 307.32us 4 76.830us 76.511us 77.119us cudaStreamCreateWithPriority
0.00% 220.45us 1 220.45us 220.45us 220.45us cudaHostAlloc
0.00% 134.88us 1 134.88us 134.88us 134.88us cudaStreamCreate
0.00% 49.086us 4 12.271us 6.1760us 16.223us cuDeviceTotalMem
0.00% 36.479us 1 36.479us 36.479us 36.479us cudaGetDeviceProperties
0.00% 17.184us 6 2.8640us 992ns 8.1280us cuDeviceGetCount
0.00% 12.800us 2 6.4000us 4.8960us 7.9040us cudaGetDeviceCount
0.00% 7.2000us 4 1.8000us 1.2800us 3.2320us cuDeviceGetName
0.00% 7.0080us 5 1.4010us 480ns 3.1360us cuDeviceGet
0.00% 5.9200us 3 1.9730us 1.8240us 2.1760us cuInit
0.00% 4.8320us 1 4.8320us 4.8320us 4.8320us cudaHostGetDevicePointer
0.00% 3.9680us 3 1.3220us 1.0240us 1.6000us cuDriverGetVersion
0.00% 2.4000us 1 2.4000us 2.4000us 2.4000us cudaDeviceGetStreamPriorityRange
I found they use different gpu kerenel
tvm use something like fuse_conv2d_broadcast_add_relu_11_kernel0
but mxnet didn’t use that kind of kernel
By the way, Do i have to care about the following warning ? Or I can ignore it ?
WARNING:autotvm:Cannot find config for target=cuda -libs=cudnn,cublas, workload=('conv2d', (1, 128, 1, 1, 'float32'), (16, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.