How to choose the correct the target for CPU?

It’s my CPU information:

If I compile a model or do auto-tuning, how to choose the proper target?

Before tuning, I had done some experiments about compiling. When I set it to “llvm -mcpu=broadwell”, as for their inference time, it’s faster than “llvm -mcpu=haswell”.

Replace “llvm” with the correct target of your CPU. For example, > for AWS EC2 c5 instance with Intel Xeon,Platinum 8000 series, > the target should be “llvm -mcpu=skylake-avx512”.For AWS
EC2 c4 instance with Intel Xeon E5-2666 v3, it should be “llvm -> mcpu=core-avx2”.

I tried “llvm -mcpu=x86_64”,but it showed ‘x86_64’ cannot be recognized.

Normally we look up the cpu type online to find its generation from the official webpages.

According to this page, yours is a CPU in the Broadwell series, so I think your selection of llvm -mcpu=broadwell is the right way to go.

1 Like

Thanks for your reply. I had done tuning according to the tune_relay_x86 tutorial yesterday by setting target = tvm.target.create("llvm -mcpu=broadwell"), but the inferencing time was longer than before tuning. And it reported warnings as below:

……
Cannot find config for target=llvm -device=tracing, workload=('conv2d', (1, 128, 56, 56, 'float32'), (128, 128, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=tracing, workload=('conv2d', (1, 128, 28, 28, 'float32'), (256, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=tracing, workload=('conv2d', (1, 256, 28, 28, 'float32'), (256, 256, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=tracing, workload=('conv2d', (1, 256, 14, 14, 'float32'), (512, 256, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=tracing, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
2019-07-26 02:12:28,616 INFO Start to benchmark layout transformation...
2019-07-26 02:19:41,903 INFO Benchmarking layout transformation successful.
2019-07-26 02:19:41,926 INFO Start to run dynamic programming algorithm...
2019-07-26 02:19:41,926 INFO Start forward pass...
2019-07-26 02:19:42,538 INFO Finished forward pass.
2019-07-26 02:19:42,538 INFO Start backward pass...
2019-07-26 02:19:42,540 INFO Finished backward pass...
2019-07-26 02:19:42,540 INFO Finished DPExecutor run.
2019-07-26 02:19:42,542 INFO Writing optimal schedules to mxnet-r50_cpu_graph_opt.log successfully.
Compile...
Config for target=llvm -mcpu=broadwell, workload=('dense', (1, 25088, 'float32'), (512, 25088, 'float32'), 0, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.

How to slove it?

What is the execution time difference between before/after tuning?

Before tuning, it’s about 100ms on my CPU. After compiling, it’s about 50ms. But after tuning, it’s bigger than 100ms.

So default schedules give 100 ms latency but graph tuned schedules give > 100 ms. That tutorial is the same as using default schedules and should have the same performance as before tuning. You can use debugger runtime to see which operators cause the gap.

Did you follow the tuning option in tutorial? Can you share your tuning_option?

Thanks for reply. I used the default tuning option in tutorial. Any suggestion?

tuning_option = {
    'log_filename': log_file,
    'tuner': 'random',
    'early_stopping': None,

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.LocalRunner(number=10, repeat=1,
                                   min_repeat_ms=1000),
    ),
}

You can try to set min_repeat_ms=4000. Also you can use debugger runtime to see which operator takes longer time than expected, comparing to default schedule.