My model has many output nodes and I found that copying output data from gpu is time consuming, which happens at the following line of code:
m.get_output(index)).asnumpy()
and the first output node costs the most.
Can someone explain why the first one costs so much? Any idea to obtain outputs efficiently? Thank you very much!
Thank you for reply! It reduces time used for get_output.
But adding this, the run() duration is longer, and as a result, the overall duration does not change. Can I do asynchronous memcopy from gpu to solve this problem?