Auto-tuned model speed up issue

After 15days of auto-tuning, I got a auto-tuned model, but the speed is not satisfactory, any advice? Can I modified the tuning-parameters but using the history best log as a checkpoint and resume from it?

1 Like

You can try to use xgb search approach and enable transfer learning to see if that helps. See this tutorial for the example. Specifically, you could use the following logic to enable transfer learning before launching the tuner.

        if use_transfer_learning:
            if os.path.isfile(tmp_log_file):
                tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
1 Like

The model is tuned using xgb tuner, I mean what tuning options should I set to search for larger space and get better result? Thanks, man.

Unfortunately there’s no user APIs to adjust the tuning space if your ops are using TOPI schedule templates. On the other hand, I personally don’t think larger space could achieve better performance, because the performance might be limited by the schedule template instead of configs – especially you have spent 15 days. I would suggest trying to first find out the performance bottleneck (e.g., which layer/op), and see if we can use an alternative op to speedup. For example, if the performance bottleneck is conv2d in NCHW layout, we can try to speedup with HWCN or Winograd algorithm.

Thanks, man. I just follow the tutorials you give to auto-tune. And all my ops contains only conv2d/deconv2d/pool/batchnorm/relu. Two questions:

  1. How can I know if I was using TOPI schedule template instead of configs?
  2. How to test the performance bottleneck, any advice? And the tuned model was already using Winograd, how to speedup with HWCN?

Thanks. man.

  1. Basically if you followed the tutorial to extract tasks from program then you are using TOPI schedules.

  2. You can refer to Profiling a TVM run on CPU and use graph runtime debugger that will show you the time breakdown of each op.

If you already use Winograd, then the benefit from HWCN might be moderated, because HWCN cannot be applied to the Winograd. Anyway, you could identify the bottleneck first and report its ops/shape/attributes, so that other folks can also provide there thoughts.

Thanks, man. I will give it a try.

Hi, man, I try your profiling code, and get the following output as follows, any advice for improve the speed? The log is here tvm_profiling.log. Could you please give me some advice? Thank you very much.

By sorting the result based on the time (%), the most time-consuming ops (>2%) are the following

Ops Time(us) Time(%) Shape

fused_nn_conv2d_add_nn_relu_7 5114.48 13.07 (1, 194, 40, 40) fused_nn_conv2d_transpose_add_add_nn_relu 3399.83 8.688 (1, 76, 160, 160) fused_nn_conv2d_transpose_add_add_nn_relu_2 2816.88 7.198 (1, 304, 40, 40) fused_nn_conv2d_transpose_add_add_nn_relu_1 2786.62 7.121 (1, 152, 80, 80) fused_nn_conv2d_add_nn_relu_3 2099.46 5.365 (1, 509, 20, 20)

So you can further tune the those 5 tasks.

1 Like

Thank you very much, I will give it a try.

Hi, @comaniac. How can I extract the selected tasks with the name? I run the following code to extract all tasks:

tasks = autotvm.task.extract_from_program(mod["main"], target=target,params=params, ops=(relay.op.nn.conv2d,))

But when I print the name, it is always ‘topi_nn_conv2d’. So I have two questions:

  1. how to extract a certain task with name, say, ‘fused_nn_conv2d_add_nn_relu_7’.
  2. when extrack task using the function autotvm.task.extract_from_program, what does the ops arguments mean? If my network contains deconvolution, should I add it to ops arguments?

Thanks, man.

  1. There won’t be a op named “fused…”. Fusion happens when building the model while auto-tuning only takes one op. Thus what you have seen is correct. Their task names are all “topi_nn_conv2d”.

  2. Arguments mean the attributes of that op. For example, the arguments of conv2d are

  • Input data
  • Input weight
  • Strides
  • Padding
  • Dilation
  • Data layout
  • Data type

Different ops require different arguments when creating tasks. As your question, deconvolution is another op, so it will have different task name, so do the arguments. In fact, you can print out those information after extracting tasks, such as print(task.args).

So how can I only auto-tune the “fused…” tasks? How to choose the most time-consuming ops as the profiling log shows?

What you need to focus on is the shape and attributes of the op. For example, fused_nn_conv2d_add_nn_relu_7 5114.48 13.07 (1, 194, 40, 40) means the input shape is (1, 194, 40, 40), so you go back to the topi_nn_conv2d tasks and see which one has this shape.

ok, so I need to match the task by shape, thanks. I will give it a try.

But what if two ops have same shape? How can I tell the difference of them? I think the ops should have a name attribute.

If ops have same shape and parameters, then tuned schedule is the same, that means there will be only one task with that specific shape and parameters.

Hi, man please check the log I post above, it is true many ops have same shape, I cannot tell wich tasks it belongs just by shape. Please help me. Thank you very much.

Hi, Comaniac, I have two questions about this issue.

  1. I found some tasks with the same shape, how can I tell between them?
  2. And I found that there are 106 tasks, when extract tasks, but in the log file, there are 158ops in the graph_runtime_debug.cc:93:log. But there are 177 ops in the time statics table. What do these three different numbers stand for?
    Thank you very much.
    Best,
    Bin

Compare the workload, every task workload is unique