CUDA Injective schedule thread count is not always optimal


#1

In CUDA’s default injective schedule, the thread count is set to tvm.target.current_target(allow_none=False).max_num_threads. This leads to a huge thread count per block (1024 in the case of the M60 GPU I am testing on). This is not optimal for all ops.

For example, I am currently testing a softmax with input shape (10, 12, 512, 512), and found that 64 threads per block saved over 10ms.

On the flip side, I have found other ops that benefit from having this large thread count.

What do you think is the best way to resolve this? Can we auto-tune this value per op that uses the injective schedule? Should we allow ops to pass in their ideal thread count?


#2

I feel like the right idea would be to parameterize and auto-tune thread count for ops that use the default injective schedule. Is there a good way to do this? Otherwise, we can just copy the injective schedule into softmax.py and do the auto-tuning there, but that’s less than ideal.

@vinx13 @masahi do you have any thoughts?


#3

There is no deep reason for why we use 1024 threads per block on CUDA. I’m +1 for making this number tunable, as long as the default is 1024 (to avoid perf regression).


#4

if the default injective schedule is slow for some particular ops, it would be good to have a tunable schedule