In CUDA’s default injective schedule, the thread count is set to
tvm.target.current_target(allow_none=False).max_num_threads. This leads to a huge thread count per block (1024 in the case of the M60 GPU I am testing on). This is not optimal for all ops.
For example, I am currently testing a softmax with input shape (10, 12, 512, 512), and found that 64 threads per block saved over 10ms.
On the flip side, I have found other ops that benefit from having this large thread count.
What do you think is the best way to resolve this? Can we auto-tune this value per op that uses the injective schedule? Should we allow ops to pass in their ideal thread count?