TL;DR. The problem is the data actually could fit into shared memory if each block hold 1/k of them, where k is the number of blocks. However, a cached_read
followed by split + bind to blockIdx
results in k times more memory allocated.
Background. Often in CUDA, blocks may cooperatively load slices of data into their own shared memory, it is especially useful when their workload is independent of each other. So, in TVM, we may want to load shards (or a slice) of an array into the on-chip memory, instead the whole. It is (imho) critical step when we want to construct persistent RNN (link).
Why not load them all. Shared memory is scarce (96 KB on V100). A square FP32 weight matrix of 256 * 256 could be 256 KB. This results in runtime error.
Why not compute_at. This results in unacceptably unnecessary memory access. When doing GEMM, it is good to put the cache_read
when they are actually needed using compute_at
. However, in a persistent RNN, GEMM is invoked literally on every time step. If we use compute_at to move the load to GEMM computation, the data will be repeatedly read for many many times, which causes unnecessary waste of memory bandwidth.
Example. Let’s assume m, n, k are some known constants, so that TVM is able to allocate static memory. Let w be an (m * n) array, and there are k blocks on GPU. WLOG, say m is divisible by k. In blocks i, we may want to load only w[m / k * i : m / k * (i + 1), :] into the processor’s on-chip memory.
### Python code
wS = s.cache_read(w, "shared", readers=readers)
m, _ = s[wS].op.axis
bx, _ = s[wS].split(m, nparts=num_sm)
s[wS].bind(bx, blockIdx.x)
### However, it translates to
allocate w.shared[float32 * m * n]
### Instead, what we actually need is:
allocate w.shared[float32 * (m / k) * n] # k is the number of SMs
I am relatively new to the scheduling sub-language in TVM. Any suggestions? Thanks in advance!