Where did cache_read local copy data to?


On CUDA, what’s the benefit of using cache_read loading something to local memory? I assume accessing local memory is as slow as global memory, because both of them resides in off-chip DRAM.

(I copy-pasted the question I asked privately and the answer I got to the forum so other could benefit from it, because it is a clarification question)


The answer I got:

“local roughly means register, because compiler will lift it.”


We know that if the size of the local memory is determined, and it is small enough for register, nvcc is capable of lift it to registers. But if the local memory is large or cannot be determined in compilation, it might lead to spills to cache or off-chip DRAM.