Hello! I see in the following tutorial:
it says that
try_unroll_vec are needed for architectures like Skylake. I wonder if there are any detailed explanations about this decision? Since in this example
try_unroll_vec are applied to local accumulation of a small block of output, I wonder if this is the only place that these functions should be applied? Or there are other situations we should apply these functions to?
Thanks in advance!