@tqchen following up from https://github.com/apache/incubator-tvm/pull/4270 the following gist is the example that motivated this. Error I get from this is “TVMError: Not all Vars are passed in api_args: ‘threadIdx.x’ does not appear in api_args”.
There seemingly are opportunities to fall into situations where the inference bound will “pass back” a domain of
[threadIdx.x, 1] however this can cause issues when the attr statement comes later in the IR. This can actually show up as 2 errors, the first is above, if that is silenced a later use before defined error will pop up.
The GEMM structure in the gist has the basic structure of a high performing GEMM. We would want to do vector loads from the input matrices as we reduce along K, reduce that vector, then do a thread all reduce. The other thing I haven’t been able to get functioning though is cacheing the block tile in shared memory before starting the computation. There’s quite a bit of opportunity for data reuse so it’s quite a waste to not be able to directly cache this in shared memory. You can see the last few cache_read statements.