Optimizing memory for ARM CPU on Android

My TF model has 120MB. TVM params for it are around 320MB and when I run it on device it uses 969MB of heap.
Is there any way (except disabling winograd) to reduce the memory footprint?

I have the custom inference engine which runs exactly the same model (using nnpack for convolutions) which needs only around 320MB (but is 2 times slower)