Question in


In the How to optimize GEMM on CPU tutorial’s Array Packing section, there is a picture:


I think it wants to transform matrix B to store by column instead of storing by row, but why the tutorial reorder a [16][16] array to a [16/4][16][4] array? I don’t get it and want to get some help from you, thank you ~


This is done so that the layout of the memory better matches the access pattern. Because after vectorization the access pattern of B is four columns at a time, this does not match the initial memory layout. Note the difference between this repacking and switching directly to column-major. Because the unit of vectorization is four elements, the access pattern touches four columns (instead of just a single column) before moving to the next row.