Get multiple outputs efficiently

My model has many output nodes and I found that copying output data from gpu is time consuming, which happens at the following line of code:
m.get_output(index)).asnumpy()
and the first output node costs the most.
Can someone explain why the first one costs so much? Any idea to obtain outputs efficiently? Thank you very much!

@daweili1226 How do you measure the overhead? Have you tried ctx.sync() before get_output?

Thank you for reply! It reduces time used for get_output.
But adding this, the run() duration is longer, and as a result, the overall duration does not change. Can I do asynchronous memcopy from gpu to solve this problem?