Global barrier implementation on CUDA

Global barrier is a nice-to-have feature on CUDA devices, which is particularly useful in persistent RNN. However, global barrier violates CUDA model before 9.0 is out. CUDA does not guarantee forward progress across multiple Cooperative Thread Arrays.

It is also mentioned by @tqchen in a comment in PR #362.

global barrier is usually a feature that is not supported by GPUs(Nvidia add that support recently). it means the ability to sync all threads launched. This is used in persistent threading model where all thread launches and communicate through the shared global memory.

I glanced through Baidu’s Persistent RNN implementation, their approach looks like: spin for a while, kill the thread if no progress is made, and it has to re-read data again upon being re-launched.

So is the current support for global barrier actually correct? Is there any plan to support such feature using CUDA 9’s cooperative groups?

Thanks!