On CUDA, what’s the benefit of using cache_read loading something to local memory? I assume accessing local memory is as slow as global memory, because both of them resides in off-chip DRAM.
(I copy-pasted the question I asked privately and the answer I got to the forum so other could benefit from it, because it is a clarification question)