Currently we need to use strided_slice
to split the output of CombineParallelConv2D
. If the batch size = 1, or the C is in the leading dimension, the copy can be eliminated. We can reuse the original buffer and calculate offsets of each slice.
That is true, however, we can skip this for now as it is not necessarily the bottleneck and might require a bit more thoughts in planning.