CUDA Injective schedule thread count is not always optimal


In CUDA’s default injective schedule, the thread count is set to This leads to a huge thread count per block (1024 in the case of the M60 GPU I am testing on). This is not optimal for all ops.

For example, I am currently testing a softmax with input shape (10, 12, 512, 512), and found that 64 threads per block saved over 10ms.

On the flip side, I have found other ops that benefit from having this large thread count.

What do you think is the best way to resolve this? Can we auto-tune this value per op that uses the injective schedule? Should we allow ops to pass in their ideal thread count?


I feel like the right idea would be to parameterize and auto-tune thread count for ops that use the default injective schedule. Is there a good way to do this? Otherwise, we can just copy the injective schedule into and do the auto-tuning there, but that’s less than ideal.

@vinx13 @masahi do you have any thoughts?


There is no deep reason for why we use 1024 threads per block on CUDA. I’m +1 for making this number tunable, as long as the default is 1024 (to avoid perf regression).


if the default injective schedule is slow for some particular ops, it would be good to have a tunable schedule