[Tensorflow Frontend] Could we make the layout be NCHW instead of NHWC

FrozenGene · January 14, 2019, 7:49am

Currently, our frontend translate Tensorflow’s data layout being NHWC as it is and when translate convolution we flip the data layout be NCHW. I want to know whether we can translate Tensorflow’s layout be NCHW on the fly in our frontend, then we can avoid the transpose. If we have heavy workloads of convolution, the transpose cost maybe very large. I translated the Tensorflow Lite frontend from NHWC to Relay NCHW, I think there are only some special ops(reshape, squeeze) we need take care specially, and I think it is one worthy thing. For example, Intel OpenVINO, TensorRT, Tensorflow-CoreML converters accept Tensorflow NHWC input but all output NCHW layout @srkreddy1238

srkreddy1238 · January 14, 2019, 9:32am

@FrozenGene
I tried full network layout change attempt some time back and yes, the special ops customisation is challenge.
I found that the operator list will keep growing and it’s difficult to generalise for all cases.

Hence I went with transpose across convolution. The beauty with TVM is this transpose operation is getting fused with other ops. I think we could check the performance difference with fully-converted to transpose-converted for some model and then look into this new direction.

FrozenGene · January 14, 2019, 10:22am

@srkreddy1238 I want to express my point why I support Tensorflow FE be NCHW too.

Firstly, for the special ops, I think it is one balance. Intel OpenVINO / TensorRT / TF-CoreML / Our TFLite frontend faces the similar problem. When I translate TFLite firstly, I also meet the same question, keep NHWC or translate to NCHW? But I choose to translate to NCHW. The reasons:

Performance. in our internal TF model, the bad case of transpose time even occupy 1 / 3 ~ 1 / 2 of total execution time. I think this will be worse on Edge devices which doesn’t have strong CPU / GPU or heavier data workload of convolution.
Our TVM internal optimiztion support. Our optimization support NCHW very well. For example auto tunning in ARM CPU, we only support depthwise convolution nchw data layout.
Unified Experience. Our others FEs almost consume / output NCHW layout. If we can unify the FEs’s layout is NCHW, then users will not confuse input / output layout.
Special ops’s strange transformation is not common. For the special ops like Reshape / Squeeze, the strange transformation like 7D -> 3D or others is not common. We could support common shape transformation like I do in TFLite FEs. Then if we have special requirement we could add later. This is adopted by other Framework(converters) like Intel OpenVINO / TF-CoreML too.

tqchen · January 14, 2019, 10:02pm

i feel that automatic layout transformation should be done at the level of the IR(relay), rather than the frontend itself.

yzhliu · January 25, 2019, 3:58pm

+1 for doing it in relay/nnvm. The most optimized layout can be different given the target devices. The key to eliminating layout transform is have other operators, (pooling, reshape, etc) be adaptable to a modified layout. This is more general solution for all targets.