[NNVM/Relay] How can the graph-level optimizations in nnvm or relay work in the traing phase?

It is easy to understand that the graph-level optimizations make sense in the inference phase, because we only need the results of the final layer.
However, in the training phase, it contains two passes. We need the activations of each layer in the forward pass and also the gradients propagated from backward. So, I wonder how the graph-level optimizations , such as operator fusion, work in the training process?

Backward pass doesn’t differ a lot from forward pass. Some gradient nodes and weight update nodes will be added. Operator fusion should work similarly in forward pass. Passes like simplify batchnom might need different implementation.

Thank you for your explaination.

The key point is how to cope with the immediate result.
In the training phase, all of the immediate results are necessary, and cannot be dropped by operator fusion.
I still wonder how is graph optimization applied in the training phase.

Forward + backward pass is also a DAG. In this case, operator fusion should be able to obey the same rule.

I see, the explanation is clear, thank you!