[AutoTVM] Selective tuning of hotspots

robeastbme · March 25, 2020, 5:38am

Dear community,

I’m currently trying to reduce overall Auto-TVM runtimes by selectively tuning only the kernels that are actual hotspots in the application.

Hotspot detection can be performed fairly easily, e.g. by using the debug runtime which gives a detailed callgraph profile when executing run().

My question is how to match these identified operations to the AutoTVM selected kernels.

On the one hand, the profile information looks like this example shows. A prioritized list of nodes mostly identified by their LLVM IR name.

On the other hand, when selecting the tasks to be tuned kernels =autotvm.task.extract_from_program(ir["main"], target=target, params=params, ops=None) gives you a list of Task objects, e.g.:

Task(func_name=dense_nopack.x86, args=(('TENSOR', (1, 16), 'float32'), ('TENSOR', (64, 16), 'float32'), None, 'float32'), kwargs={}, workload=('dense_nopack.x86', ('TENSOR', (1, 16), 'float32'), ('TENSOR', (64, 16), 'float32'), None, 'float32'))

My question refers to how to match such Tasks to their IR counterparts?

Any help, ideas, suggestions are much appreciated! Thank you & Best regards

comaniac · March 25, 2020, 6:10pm

It’s a bit tricky. For now you can only match op type and shape.

robeastbme · March 25, 2020, 8:54pm

I see! That’s a pity if convolutios coexist with the same shapes… Maybe still can be assiciated somehow… Anyhow, thank you very much for your answer

comaniac · March 25, 2020, 10:40pm

Well…if more than two convs have the same shape (both input and weight), then they will be the same tuning task. The tricky part is that it’s not straightforward to see the weight shape from the debug runtime log.

robeastbme · March 26, 2020, 6:32am

Damn you are right! Hm… matching the debug runtime output to the LLVM IR is fairly easy. I don’t know whether the shapes are somehow encoded in the LLVM IR. My guess: no…

robeastbme · April 4, 2020, 8:53pm

Alright! I performed some profiling. I’ve taken this tutorial as a basis, using only VGG-16.

This is the output of the debug runtime :

Node Name                                     Ops                                          Time(us)    Time(%)  Shape                 Inputs  Outputs  
---------                                     ---                                          --------    -------  -----                 ------  -------  
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5  107147.0    12.758   (1, 2, 112, 112, 64)  3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7  99490.1     11.847   (1, 2, 224, 224, 32)  3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1  96205.3     11.456   (1, 32, 28, 28, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11  fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1  96064.8     11.439   (1, 32, 28, 28, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_31  fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3  94304.2     11.229   (1, 8, 56, 56, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3  94182.4     11.215   (1, 8, 56, 56, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6  52424.5     6.242    (1, 2, 112, 112, 64)  3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4  48494.9     5.775    (1, 8, 56, 56, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2  45365.3     5.402    (1, 16, 28, 28, 32)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu1    fused_nn_contrib_conv2d_NCHWc_add_nn_relu    23580.5     2.808    (1, 32, 14, 14, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu     fused_nn_contrib_conv2d_NCHWc_add_nn_relu    23557.3     2.805    (1, 32, 14, 14, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu2    fused_nn_contrib_conv2d_NCHWc_add_nn_relu    23549.4     2.804    (1, 32, 14, 14, 16)   3       1        
fused_nn_dense_add_nn_relu_1                  fused_nn_dense_add_nn_relu_1                 19578.8     2.331    (1, 4096)             3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8   fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8  5776.86     0.688    (1, 2, 224, 224, 32)  3       1        
fused_nn_dense_add_nn_relu                    fused_nn_dense_add_nn_relu                   3206.76     0.382    (1, 4096)             3       1        
fused_layout_transform_19                     fused_layout_transform_19                    2011.35     0.24     (1, 4, 224, 224, 16)  1       1        
fused_nn_max_pool2d_4                         fused_nn_max_pool2d_4                        964.965     0.115    (1, 2, 112, 112, 32)  1       1        
fused_nn_dense_add                            fused_nn_dense_add                           784.874     0.093    (1, 1000)             3       1        
fused_layout_transform_171                    fused_layout_transform_17                    561.36      0.067    (1, 1, 56, 56, 256)   1       1        
fused_layout_transform_17                     fused_layout_transform_17                    559.474     0.067    (1, 1, 56, 56, 256)   1       1        
fused_nn_max_pool2d_3                         fused_nn_max_pool2d_3                        495.34      0.059    (1, 2, 56, 56, 64)    1       1        
fused_layout_transform_15                     fused_layout_transform_15                    289.293     0.034    (1, 1, 28, 28, 512)   1       1        
fused_layout_transform_14                     fused_layout_transform_14                    232.466     0.028    (1, 1, 28, 28, 512)   1       1        
fused_nn_max_pool2d_2                         fused_nn_max_pool2d_2                        221.099     0.026    (1, 8, 28, 28, 32)    1       1        
fused_layout_transform_nn_batch_flatten       fused_layout_transform_nn_batch_flatten      179.002     0.021    (1, 25088)            1       1        
fused_layout_transform_16                     fused_layout_transform_16                    135.138     0.016    (1, 1, 28, 28, 256)   1       1        
fused_nn_max_pool2d_1                         fused_nn_max_pool2d_1                        109.964     0.013    (1, 32, 14, 14, 16)   1       1        
fused_layout_transform_18                     fused_layout_transform_18                    101.463     0.012    (1, 4, 56, 56, 32)    1       1        
fused_layout_transform_20                     fused_layout_transform_20                    66.265      0.008    (1, 1, 224, 224, 3)   1       1        
fused_layout_transform_13                     fused_layout_transform_13                    46.61       0.006    (1, 1, 14, 14, 512)   1       1        
fused_layout_transform_131                    fused_layout_transform_13                    45.106      0.005    (1, 1, 14, 14, 512)   1       1        
fused_layout_transform_132                    fused_layout_transform_13                    38.617      0.005    (1, 1, 14, 14, 512)   1       1        
fused_nn_max_pool2d                           fused_nn_max_pool2d                          24.039      0.003    (1, 32, 7, 7, 16)     1       1        
fused_nn_softmax                              fused_nn_softmax                             16.067      0.002    (1, 1000)             1       1        
Total_time                                    -                                            839810.613  -        -                     -       -

I also printed out the LLVM IR for main:

fn (%data: Tensor[(1, 3, 224, 224), float32]) -> Tensor[(1, 1000), float32] {
  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), float32] */ /* ty=Tensor[(64, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */ /* ty=Tensor[(64), float32] */) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %2 = nn.relu(%1) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %3 = nn.conv2d(%2, meta[relay.Constant][2] /* ty=Tensor[(64, 64, 3, 3), float32] */ /* ty=Tensor[(64, 64, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %4 = nn.bias_add(%3, meta[relay.Constant][3] /* ty=Tensor[(64), float32] */ /* ty=Tensor[(64), float32] */) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %5 = nn.relu(%4) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %6 = nn.max_pool2d(%5, pool_size=[2, 2], strides=[2, 2], layout="NCHW32c") /* ty=Tensor[(1, 64, 112, 112), float32] */;
  %7 = nn.conv2d(%6, meta[relay.Constant][4] /* ty=Tensor[(128, 64, 3, 3), float32] */ /* ty=Tensor[(128, 64, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 112, 112), float32] */;
  %8 = nn.bias_add(%7, meta[relay.Constant][5] /* ty=Tensor[(128), float32] */ /* ty=Tensor[(128), float32] */) /* ty=Tensor[(1, 128, 112, 112), float32] */;would you be able to you see how these can be matched?
  %9 = nn.relu(%8) /* ty=Tensor[(1, 128, 112, 112), float32] */;
  %10 = nn.conv2d(%9, meta[relay.Constant][6] /* ty=Tensor[(128, 128, 3, 3), float32] */ /* ty=Tensor[(128, 128, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 112, 112), float32] */;
  %11 = nn.bias_add(%10, meta[relay.Constant][7] /* ty=Tensor[(128), float32] */ /* ty=Tensor[(128), float32] */) /* ty=Tensor[(1, 128, 112, 112), float32] */;
  %12 = nn.relu(%11) /* ty=Tensor[(1, 128, 112, 112), float32] */;
  %13 = nn.max_pool2d(%12, pool_size=[2, 2], strides=[2, 2], layout="NCHW64c") /* ty=Tensor[(1, 128, 56, 56), float32] */;
  %14 = nn.conv2d(%13, meta[relay.Constant][8] /* ty=Tensor[(256, 128, 3, 3), float32] */ /* ty=Tensor[(256, 128, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %15 = nn.bias_add(%14, meta[relay.Constant][9] /* ty=Tensor[(256), float32] */ /* ty=Tensor[(256), float32] */) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %16 = nn.relu(%15) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %17 = nn.conv2d(%16, meta[relay.Constant][10] /* ty=Tensor[(256, 256, 3, 3), float32] */ /* ty=Tensor[(256, 256, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %18 = nn.bias_add(%17, meta[relay.Constant][11] /* ty=Tensor[(256), float32] */ /* ty=Tensor[(256), float32] */) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %19 = nn.relu(%18) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %20 = nn.conv2d(%19, meta[relay.Constant][12] /* ty=Tensor[(256, 256, 3, 3), float32] */ /* ty=Tensor[(256, 256, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %21 = nn.bias_add(%20, meta[relay.Constant][13] /* ty=Tensor[(256), float32] */ /* ty=Tensor[(256), float32] */) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %22 = nn.relu(%21) /* ty=Tensor[(1, 256, 56, 56), float32] */;
  %23 = nn.max_pool2d(%22, pool_size=[2, 2], strides=[2, 2], layout="NCHW32c") /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %24 = nn.conv2d(%23, meta[relay.Constant][14] /* ty=Tensor[(512, 256, 3, 3), float32] */ /* ty=Tensor[(512, 256, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %25 = nn.bias_add(%24, meta[relay.Constant][15] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %26 = nn.relu(%25) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %27 = nn.conv2d(%26, meta[relay.Constant][16] /* ty=Tensor[(512, 512, 3, 3), float32] */ /* ty=Tensor[(512, 512, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %28 = nn.bias_add(%27, meta[relay.Constant][17] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %29 = nn.relu(%28) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %30 = nn.conv2d(%29, meta[relay.Constant][18] /* ty=Tensor[(512, 512, 3, 3), float32] */ /* ty=Tensor[(512, 512, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %31 = nn.bias_add(%30, meta[relay.Constant][19] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %32 = nn.relu(%31) /* ty=Tensor[(1, 512, 28, 28), float32] */;
  %33 = nn.max_pool2d(%32, pool_size=[2, 2], strides=[2, 2], layout="NCHW16c") /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %34 = nn.conv2d(%33, meta[relay.Constant][20] /* ty=Tensor[(512, 512, 3, 3), float32] */ /* ty=Tensor[(512, 512, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %35 = nn.bias_add(%34, meta[relay.Constant][21] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %36 = nn.relu(%35) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %37 = nn.conv2d(%36, meta[relay.Constant][22] /* ty=Tensor[(512, 512, 3, 3), float32] */ /* ty=Tensor[(512, 512, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %38 = nn.bias_add(%37, meta[relay.Constant][23] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %39 = nn.relu(%38) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %40 = nn.conv2d(%39, meta[relay.Constant][24] /* ty=Tensor[(512, 512, 3, 3), float32] */ /* ty=Tensor[(512, 512, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %41 = nn.bias_add(%40, meta[relay.Constant][25] /* ty=Tensor[(512), float32] */ /* ty=Tensor[(512), float32] */) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %42 = nn.relu(%41) /* ty=Tensor[(1, 512, 14, 14), float32] */;
  %43 = nn.max_pool2d(%42, pool_size=[2, 2], strides=[2, 2], layout="NCHW16c") /* ty=Tensor[(1, 512, 7, 7), float32] */;
  %44 = nn.batch_flatten(%43) /* ty=Tensor[(1, 25088), float32] */;
  %45 = nn.dense(%44, meta[relay.Constant][26] /* ty=Tensor[(4096, 25088), float32] */ /* ty=Tensor[(4096, 25088), float32] */, units=4096) /* ty=Tensor[(1, 4096), float32] */;
  %46 = nn.bias_add(%45, meta[relay.Constant][27] /* ty=Tensor[(4096), float32] */ /* ty=Tensor[(4096), float32] */, axis=-1) /* ty=Tensor[(1, 4096), float32] */;
  %47 = nn.relu(%46) /* ty=Tensor[(1, 4096), float32] */;
  %48 = nn.dropout(%47) /* ty=(Tensor[(1, 4096), float32], Tensor[(1, 4096), float32]) */;
  %49 = %48.0;
  %50 = nn.dense(%49, meta[relay.Constant][28] /* ty=Tensor[(4096, 4096), float32] */ /* ty=Tensor[(4096, 4096), float32] */, units=4096) /* ty=Tensor[(1, 4096), float32] */;
  %51 = nn.bias_add(%50, meta[relay.Constant][29] /* ty=Tensor[(4096), float32] */ /* ty=Tensor[(4096), float32] */, axis=-1) /* ty=Tensor[(1, 4096), float32] */;
  %52 = nn.relu(%51) /* ty=Tensor[(1, 4096), float32] */;
  %53 = nn.dropout(%52) /* ty=(Tensor[(1, 4096), float32], Tensor[(1, 4096), float32]) */;
  %54 = %53.0;
  %55 = nn.dense(%54, meta[relay.Constant][30] /* ty=Tensor[(1000, 4096), float32] */ /* ty=Tensor[(1000, 4096), float32] */, units=1000) /* ty=Tensor[(1, 1000), float32] */;
  %56 = nn.bias_add(%55, meta[relay.Constant][31] /* ty=Tensor[(1000), float32] */ /* ty=Tensor[(1000), float32] */, axis=-1) /* ty=Tensor[(1, 1000), float32] */;
  nn.softmax(%56) /* ty=Tensor[(1, 1000), float32] */
}

And lastly, I iterate through the tasks gathered by AutoTVM and print their properties:

[Kernel 0/11] dense_nopack.x86 - Search space size: 208
     Output previous layer :  ('TENSOR', (1, 4096), 'float32')
     Input current layer   :  ('TENSOR', (1000, 4096), 'float32')
[Kernel 1/11] dense_nopack.x86 - Search space size: 169
     Output previous layer :  ('TENSOR', (1, 4096), 'float32')
     Input current layer   :  ('TENSOR', (4096, 4096), 'float32')
[Kernel 2/11] dense_nopack.x86 - Search space size: 390
     Output previous layer :  ('TENSOR', (1, 25088), 'float32')
     Input current layer   :  ('TENSOR', (4096, 25088), 'float32')
[Kernel 3/11] conv2d_NCHWc.x86 - Search space size: 800
     Output previous layer :  ('TENSOR', (1, 512, 14, 14), 'float32')
     Input current layer   :  ('TENSOR', (512, 512, 3, 3), 'float32')
[Kernel 4/11] conv2d_NCHWc.x86 - Search space size: 1200
     Output previous layer :  ('TENSOR', (1, 512, 28, 28), 'float32')
     Input current layer   :  ('TENSOR', (512, 512, 3, 3), 'float32')
[Kernel 5/11] conv2d_NCHWc.x86 - Search space size: 1080
     Output previous layer :  ('TENSOR', (1, 256, 28, 28), 'float32')
     Input current layer   :  ('TENSOR', (512, 256, 3, 3), 'float32')
[Kernel 6/11] conv2d_NCHWc.x86 - Search space size: 1296
     Output previous layer :  ('TENSOR', (1, 256, 56, 56), 'float32')
     Input current layer   :  ('TENSOR', (256, 256, 3, 3), 'float32')
[Kernel 7/11] conv2d_NCHWc.x86 - Search space size: 1152
     Output previous layer :  ('TENSOR', (1, 128, 56, 56), 'float32')
     Input current layer   :  ('TENSOR', (256, 128, 3, 3), 'float32')
[Kernel 8/11] conv2d_NCHWc.x86 - Search space size: 1152
     Output previous layer :  ('TENSOR', (1, 128, 112, 112), 'float32')
     Input current layer   :  ('TENSOR', (128, 128, 3, 3), 'float32')
[Kernel 9/11] conv2d_NCHWc.x86 - Search space size: 1008
     Output previous layer :  ('TENSOR', (1, 64, 112, 112), 'float32')
     Input current layer   :  ('TENSOR', (128, 64, 3, 3), 'float32')
[Kernel 10/11] conv2d_NCHWc.x86 - Search space size: 980
     Output previous layer :  ('TENSOR', (1, 64, 224, 224), 'float32')
     Input current layer   :  ('TENSOR', (64, 64, 3, 3), 'float32')
[Kernel 11/11] conv2d_NCHWc.x86 - Search space size: 280
     Output previous layer :  ('TENSOR', (1, 3, 224, 224), 'float32')
     Input current layer   :  ('TENSOR', (64, 3, 3, 3), 'float32')

Matching the AutoTVM task to the LLVM IR is relatively straight forward. For example:

[Kernel 0/11] dense_nopack.x86 - Search space size: 208
     Output previous layer :  ('TENSOR', (1, 4096), 'float32')
     Input current layer   :  ('TENSOR', (1000, 4096), 'float32')

.. matches ...

  %55 = nn.dense(%54, meta[relay.Constant][30] /* ty=Tensor[(1000, 4096), float32] */ /* ty=Tensor[(1000, 4096), float32] */, units=1000) /* ty=Tensor[(1, 1000), float32] */;
  %56 = nn.bias_add(%55, meta[relay.Constant][31] /* ty=Tensor[(1000), float32] */ /* ty=Tensor[(1000), float32] */, axis=-1) /* ty=Tensor[(1, 1000), float32] */;
  nn.softmax(%56) /* ty=Tensor[(1, 1000), float32] */

But I can’t really associate these to the shapes of the layers output by the debug runtime. With dense layers: it is eventually possible. With conv2d:s I really don’t see it… Maybe @comaniac do you see any way how to match these? I’d really appreciate your help!

Thank you & Best regards, Robert