Expressing SGPR Use for AMD GPU

(migrating issue #1021 here) @adityaatluri

To summarize, it looks like the MIOpen implementation for the first layer of resnet (7x7 conv) uses SGPRs + vector operations to do batching efficiently (sharing filter weights in SGPRs across a workgroup).

In OpenCL this is done with the __constant qualifier, but it seems that this is not currently supported in tvm (for the rocm backend). We most recently were wondering if this support could be possible using an llvm memory scope (which seems to cover the __constant case).
Any other ideas on possible support for this case? Note that this technique is used for the batched inference case, not batch_size=1

__constant is does not transform code to scalar registers. After talking to OpenCL compiler team, the correct qualifier is __uniform. But, it is not supported as a part of OpenCL spec. I recommend using inline asm; something like.

asm volatile(
"s_mov_b32 %0, 0"
:"=s"(var):);
1 Like