[Solved][Relay]X86 target performance regression

Recently I noticed two performance regression for x86 targets.

  1. After Analyzer CanonicalSimplifier commit 7afbca5691fdb599cd90b043d5a5036e55cae2d6, exec time for gluoncv ssd on c5.9x goes up from 30 ms to 40 ms.
    After profiling op, I noticed the following fused ops is much slower:
    fused_strided_slice_greater_cast_concatenate_ones_like_multiply_where_strided_sl_1860798206344026416_: 1.6386 ms -> 11.2585 ms

  2. It looks like tvm master disables avx. Now “llvm -mcpu=skylake-avx512” has similar performance of target “llvm”. This problem only happens for relay.

@tqchen @apivovarov

Does the layout transformation still occur? Recently we have seen many issues related to alter_op_layout.

avx issue details - TVM relay does not use avx for model compilation

For the first problem, layout transform looks fine. The only issue is the fused op.

@tqchen Would you think Analyzer CanonicalSimplifier commit has any part relating to the slowing down of fused op fused_strided_slice_greater_cast_concatenate_ones_like_multiply_where_strided_sl?

Thanks for confirming this. Please try to do take a detailed look at the generated code and try to do a diff comparison on the low level IR. While canonical simplify will generate different variations of index expressions.

The canonical simplify itself will make the index expression simplifier by eliminating possible div and mul. So usually it will improve the speed, but it does change the code(thus cause regression).further investigation would be helpful

Let us open a thread like https://github.com/dmlc/tvm/issues/3088, @kevinthesun can you look deeper into what is going on?

Sure. Let me open a github issue to track this.