[Runtime] Cuda async causing profiling inaccurate

If we use graph runtime to profile each op for cuda target, sometimes the execution time is inaccurate due to cuda code asynchronization. Is it possible to integrate something like cudaDeviceSynchronize into debug module?

Thanks!
@Laurawly

It seems that graph_runtime_debug.cc solves the synchronization issue. It added device synchronize.