In the context of CUDA, what is the way to limit shared memory and total threads used?
My requirement for tuning a kernel is that I want this kernel to use at most N bytes of shared memory per block, and also want it to use at most M threads, the latter one of which means all valid cases for tuning should satisfy A <= M where A = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z.