[Runtime] Cuda async causing profiling inaccurate


If we use graph runtime to profile each op for cuda target, sometimes the execution time is inaccurate due to cuda code asynchronization. Is it possible to integrate something like cudaDeviceSynchronize into debug module?



It seems that graph_runtime_debug.cc solves the synchronization issue. It added device synchronize.