Why reorder and try_unroll_vec on Skylake: any details?

Hello! I see in the following tutorial:

it says that reorder and try_unroll_vec are needed for architectures like Skylake. I wonder if there are any detailed explanations about this decision? Since in this example reorder and try_unroll_vec are applied to local accumulation of a small block of output, I wonder if this is the only place that these functions should be applied? Or there are other situations we should apply these functions to?

Thanks in advance!

@vinx13 Can you take a look at this question? Thanks!

Usually we tried to unroll the innermost loop, reorder can be applied to loops to improve loop locality. But they are not necessarily the only choice, it can be applied to other places if you think that may improve performance. Usually making the search space larger is helpful (except that it may increase search time) . Maybe @FrozenGene can provide more details