If we use graph runtime to profile each op for cuda target, sometimes the execution time is inaccurate due to cuda code asynchronization. Is it possible to integrate something like cudaDeviceSynchronize into debug module?
Thanks!
@Laurawly
If we use graph runtime to profile each op for cuda target, sometimes the execution time is inaccurate due to cuda code asynchronization. Is it possible to integrate something like cudaDeviceSynchronize into debug module?
Thanks!
@Laurawly
It seems that graph_runtime_debug.cc solves the synchronization issue. It added device synchronize.