[NNVM/Relay] How can the graph-level optimizations in nnvm or relay work in the traing phase?

lixiuhong · May 15, 2019, 12:33pm

It is easy to understand that the graph-level optimizations make sense in the inference phase, because we only need the results of the final layer.
However, in the training phase, it contains two passes. We need the activations of each layer in the forward pass and also the gradients propagated from backward. So, I wonder how the graph-level optimizations , such as operator fusion, work in the training process?

kevinthesun · May 15, 2019, 11:12pm

Backward pass doesn’t differ a lot from forward pass. Some gradient nodes and weight update nodes will be added. Operator fusion should work similarly in forward pass. Passes like simplify batchnom might need different implementation.

lixiuhong · May 16, 2019, 8:54am

Thank you for your explaination.

lixiuhong · May 20, 2019, 8:18am

The key point is how to cope with the immediate result.
In the training phase, all of the immediate results are necessary, and cannot be dropped by operator fusion.
I still wonder how is graph optimization applied in the training phase.

kevinthesun · May 20, 2019, 6:16pm

Forward + backward pass is also a DAG. In this case, operator fusion should be able to obey the same rule.

lixiuhong · May 27, 2019, 1:30am

I see, the explanation is clear, thank you!