Lib.export_library (_SaveToFile) for NHWC models hangs for i686-linux-android

apivovarov · June 26, 2019, 10:34pm

The issue exist only for NHWC models!!!

Affected models:

All tflite models
tensorflow models if you do not add parameter layout='NCHW' to relay.frontend.from_tensorflow()

I tried to compile non-quantized tflite mobilenet model for android i686
The compilation works fine, but shared library export hangs (_SaveToFile call hangs).
I tried it on Ubuntu 16, 18 and Mac OS.
llvm 6.0.0 and 7.0.1

To reproduce the issue:
non-quantized tflite mobilenet model can be downloaded from https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

install tflite package

wget https://raw.githubusercontent.com/dmlc/web-data/master/tensorflow/tflite/whl/tflite-0.0.1-py3-none-any.whl
pip3 install tflite-0.0.1-py3-none-any.whl --user

install android-ndk-r18

# for Mac OS
wget https://dl.google.com/android/repository/android-ndk-r18b-darwin-x86_64.zip
unzip android-ndk-r18b-darwin-x86_64.zip
sudo mv android-ndk-r18b /opt
export TVM_NDK_CC=/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/darwin-x86_64/bin/i686-linux-android-g++

# for Linux
wget https://dl.google.com/android/repository/android-ndk-r18b-linux-x86_64.zip
unzip android-ndk-r18b-linux-x86_64.zip
sudo mv android-ndk-r18b /opt
export TVM_NDK_CC=/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64/bin/i686-linux-android-g++

code to compile and export - lib.export_library hangs

#!/usr/bin/env python3

import tvm
from tvm import relay
from tvm.contrib import ndk
import tflite.Model

input_tensor = "input"

input_shape = (1, 224, 224, 3)
input_dtype = "float32"

m_file = "mobilenet_v1_1.0_224.tflite"
print(m_file)

tflite_model_buf = open(m_file, "rb").read()

# get TFLite model from buffer
tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)


mod, params = relay.frontend.from_tflite(tflite_model,
                                          shape_dict={input_tensor: input_shape},
                                          dtype_dict={input_tensor: input_dtype})

target = "llvm -target=i686-linux-android"
print("target: {}".format(target))

target_host = None
print("target_host: {}".format(target_host))

print("Compiling...")

with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod[mod.entry_func], target, target_host=target_host, params=params)

print("Compilation done")

print("Saving files")
# save the graph, lib and params into separate files
path_lib = "model.so"
#arch-x86 (i686-linux-android)                                        
sysroot="/opt/android-ndk-r18b/platforms/android-21/arch-x86"                         
toolchain="/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64"
options=["-shared", "-fPIC", "--sysroot", sysroot, "--gcc-toolchain="+toolchain]
lib.export_library(path_lib, ndk.create_shared, options=options)
# the line above hangs ^^^ (in particular `_SaveToFile` call hangs)
print("export_library done")

with open("model.json", "w") as fo:
    fo.write(graph)
with open("model.params", "wb") as fo:
    fo.write(relay.save_param_dict(params))

print("Files saved")

I also tried to compile and export the model for other platforms - results are below:

aarch64-linux-android - works!
x86_64-linux-android - works!
i686-linux-android - hangs!

# aarch64-linux-android - works!
export TVM_NDK_CC=/opt/android-ndk-r18b/toolchains/aarch64-linux-android-4.9/prebuilt/linux-x86_64/bin/aarch64-linux-android-g++
target = "llvm -target=aarch64-linux-android"
sysroot="/opt/android-ndk-r18b/platforms/android-21/arch-arm64"                         
toolchain="/opt/android-ndk-r18b/toolchains/aarch64-linux-android-4.9/prebuilt/linux-x86_64"

# x86_64-linux-android - works!
export TVM_NDK_CC=/opt/android-ndk-r18b/toolchains/x86_64-4.9/prebuilt/linux-x86_64/bin/x86_64-linux-android-g++
target = "llvm -target=x86_64-linux-android"
sysroot="/opt/android-ndk-r18b/platforms/android-21/arch-x86_64"                         
toolchain="/opt/android-ndk-r18b/toolchains/x86_64-4.9/prebuilt/linux-x86_64"

# i686-linux-android - hangs!
export TVM_NDK_CC=/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64/bin/i686-linux-android-g++
target = "llvm -target=i686-linux-android"
sysroot="/opt/android-ndk-r18b/platforms/android-21/arch-x86"                         
toolchain="/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64"

apivovarov · June 26, 2019, 10:37pm

@tqchen @zhiics @FrozenGene @kevinthesun Do you have any ideas why NHWC + i686-linux-android make SaveToFile hangs?

FrozenGene · June 27, 2019, 6:18am

try llvm -target=i386-linux-android Not sure it is the reason of i686 string.

apivovarov · June 27, 2019, 7:22am

I tried llvm -target=i386-linux-android with NHWC model - _SaveToFile still hangs

i686-linux-android works fine with NCHW models, bit it hangs for NHWC models (_SaveToFile hangs)

FrozenGene · June 27, 2019, 8:02am

I narrow down this issue just now. This is not related with export_library. The bug is about our compilation of i686-linux-android target. We could do this:

print("Saving files")
lib.save('a.o')

we will hang. The codegen code: https://github.com/dmlc/tvm/blob/master/src/codegen/llvm/llvm_module.cc#L83-L100

Need to investigate deeply.

apivovarov · June 27, 2019, 8:34am

Any ideas why only NHWC models are affected?

FrozenGene · June 27, 2019, 8:54am

I find it is a very interesting issue. Let us

lib.save('a.ll')

Then we use llc -mtriple=i686-unknown-linux-android -filetype=obj a.ll, we will still hang. However,
when we use clang --target=i686-unknown-linux-android a.ll -c It works very fine.

In my environment, no matter NCHW or NHWC, TFLite models will hang.

janimesh · June 27, 2019, 4:16pm

Not sure if very relevant, but I might have seen similar problem here - Slow graph_runtime.create for LLVM target - Culprit LazyInitJIT

I found that for some targets, our schedule with LLVM might take very long time. But, I have not investigated very deep, so not sure about why that happens. If blocked and need urgent fix, one can try falling back to default schedules to see if the issue still persists.

apivovarov · June 27, 2019, 7:10pm

In my environment, no matter NCHW or NHWC , TFLite models will hang.

Zhao, to see the diff btw NCHW or NHWC you need to use regular tensorflow models (pb file)

NHWC+i686 - hangs!

model_path = "mobilenet_v1_1.0_224_frozen.pb"
dshape = (1, 224, 224, 3) 
input_tensor = "input" 
...
mod, params = relay.frontend.from_tensorflow(graph_def, layout='NHWC', shape={input_tensor: dshape})
target = "llvm -target=i686-linux-android"
...
lib.export_library(path_lib, ndk.create_shared, options=options)
# ^^^Hangs - layout='NHWC' + -target=i686-linux-android

NCHW+i686 - works!

model_path = "mobilenet_v1_1.0_224_frozen.pb"
dshape = (1, 224, 224, 3) 
input_tensor = "input" 
...
mod, params = relay.frontend.from_tensorflow(graph_def, layout='NCHW', shape={input_tensor: dshape})
target = "llvm -target=i686-linux-android"
...
lib.export_library(path_lib, ndk.create_shared, options=options)
# ^^^Works fine - layout='NCHW' + -target=i686-linux-android

model .pb file can be downloaded from mobilenet_v1.md

FrozenGene · June 28, 2019, 2:59am

Thanks. I am interested in this problem. I will spend some time investigating it next week.

apivovarov · July 1, 2019, 11:24pm

I decided to check how lib.save() duration will change depends on the graph size.
Graph visualization mobilenet_v1_1.0_224.pdf
I took Tensorflow mobilenet v1 model and converted it to TFLITE format specifying different output tensors (output_arrays)

tflite_convert \
--graph_def_file=mobilenet_v1_1.0_224_frozen.pb \
--output_file=/tmp/foo.tflite \
--output_format=TFLITE \
--input_arrays=input \
--input_shapes=1,224,224,3 \
--inference_type=FLOAT \
--output_arrays="MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6"

After that I compiled resulting tflite file for llvm -target=i686-linux-android and checked the time needed to call lib.save().

Below is the results in ms

#          output_arrays name            - lib.save() time in ms
MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6 - 1,815
MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6 - 2,163
MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6 - 2,509
MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6 - 4,146
MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6 - 5,566
MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6 - 5,619
MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6 - 11,487
MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6 - 11,391
MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6 - 17,253
MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6 - 17,342
MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6 - 17,933
MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6 - 17,315
MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6 - 18,583
MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6 - 69,063
MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6 - 96,175
MobilenetV1/Predictions/Reshape_1 - 114,941

lib.save time starts to grow significantly on network level 12 and 13 (+51 and +30 sec)

lib.save() needs almost 2 min to save lib.o file for the full mobilenet_v1_1.0_224 model.

I also decided to check lib.o size for different arch and graph sizes.
i686-linux-android
x86_64-linux-android
arm64-linux-android
arm-linux-androideabi

lib.o size (bytes)           i686     x86_64    arm64    arm
Conv2d_1_pointwise/Relu6  - 136,576   61,864   61,000   89,648
Conv2d_2_pointwise/Relu6  - 173,384
Conv2d_3_pointwise/Relu6  - 210,668
Conv2d_4_pointwise/Relu6  - 275,612
Conv2d_5_pointwise/Relu6  - 338,176
Conv2d_6_pointwise/Relu6  - 443,792
Conv2d_7_pointwise/Relu6  - 547,792  277,768  288,944  470,916
Conv2d_8_pointwise/Relu6  - 547,784
Conv2d_9_pointwise/Relu6  - 547,784
Conv2d_10_pointwise/Relu6 - 547,784
Conv2d_11_pointwise/Relu6 - 547,804
Conv2d_12_pointwise/Relu6 - 744,808  327,456  338,496  663,212 (13sec)
Conv2d_13_pointwise/Relu6 - 944,968  380,480  390,560  856,512 (20sec)

I also found that lib.save duration does not depend on opt_level. Size numbers above are for opt_level=2

apivovarov · July 2, 2019, 12:30am

I also noticed that lib.o output file size is significantly different for clang and llc -
452108 vs 734696 (llc gives 62.5% bigger file)
Probably we missing some parameters when we call llc

Sizes are for model ending with tensor MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6

apivovarov · July 2, 2019, 12:40am

if you add -O3 flag to clang command then it will take the same time as llc command. It will be very slow

apivovarov · July 2, 2019, 1:00am

clang -v tells us what command it actually runs

time /usr/local/Cellar/llvm/7.0.1/bin/clang -v -O3 \
--target=i686-linux-android /tmp/lib.ll -c -o lib_clang.o

/usr/local/Cellar/llvm/7.0.1/bin/clang-7 -cc1 \
-triple i686--linux-android -emit-obj \
-disable-free -disable-llvm-verifier -discard-value-names \
-main-file-name lib.ll -mrelocation-model pic -pic-level 2 \
-mthread-model posix -fmath-errno \
-masm-verbose -mconstructor-aliases -fuse-init-array \
-target-cpu i686 -target-feature +ssse3 -dwarf-column-info \
-debugger-tuning=gdb -target-linker-version 305 \
-momit-leaf-frame-pointer -v \
-coverage-notes-file /tmp/lib_clang.gcno \
-resource-dir /usr/local/Cellar/llvm/7.0.1/lib/clang/7.0.1 \
-O3 -fdebug-compilation-dir /tmp -ferror-limit 19 -fmessage-length 181 \
-fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics \
-vectorize-loops -vectorize-slp \
-o lib_clang.o -x ir /tmp/lib.ll -faddrsig

clang -cc1 version 7.0.1 based upon LLVM 7.0.1 default target x86_64-apple-darwin16.7.0

real	0m56.101s

The options which make it slow are:

-O3
-vectorize-loops
-vectorize-slp

apivovarov · July 2, 2019, 3:51am

I found temporary workaround to get model.so file for i686-linux-android in a reasonable time.

# build lib
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(mod, 'llvm -target=i686-linux-android', params=params)

# save lib as ll file
lib.save("model.ll")

# compile .ll to .o using clang with -O3 and -fPIC flags
# It might take 60-100 sec
clang -O3 -fPIC --target=i686-linux-android model.ll -c -o model.o

# create shared library for android (I use optional -static-libstdc++ flag here)
/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64/bin/i686-linux-android-g++ \
-shared -static-libstdc++ \
--sysroot /opt/android-ndk-r18b/platforms/android-21/arch-x86 \
--gcc-toolchain=/opt/android-ndk-r18b/toolchains/x86-4.9/prebuilt/linux-x86_64 \
-o model.so model.o

clang used the following command:

/usr/lib/llvm-6.0/bin/clang -cc1 -triple i686--linux-android \
-emit-obj -disable-free -disable-llvm-verifier -discard-value-names \
-main-file-name model.ll -mrelocation-model pic -pic-level 2 \
-mthread-model posix -fmath-errno -masm-verbose \
-mconstructor-aliases -fuse-init-array -target-cpu i686 \
-target-feature +ssse3 -dwarf-column-info -debugger-tuning=gdb \
-momit-leaf-frame-pointer -v \
-coverage-notes-file /tmp/model.gcno \
-resource-dir /usr/lib/llvm-6.0/lib/clang/6.0.0 \
-O3 \
-fdebug-compilation-dir /tmp -ferror-limit 19 \
-fmessage-length 181 -femulated-tls \
-fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics \
-vectorize-loops -vectorize-slp \
-o model.o -x ir model.ll

apivovarov · July 2, 2019, 9:20pm

I run llc with --time-passes option
Most of the time is taken by

X86 DAG->DAG Instruction Selection - 7.5 sec
Type Legalization" - 5.6957 sec

$llc -mtriple=i686-linux-android -filetype=obj --time-passes model.ll -o model_llc.o
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 9.3422 seconds (9.3416 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   7.5248 ( 81.3%)   0.0180 ( 22.3%)   7.5429 ( 80.7%)   7.5428 ( 80.7%)  X86 DAG->DAG Instruction Selection
   0.6356 (  6.9%)   0.0000 (  0.0%)   0.6356 (  6.8%)   0.6355 (  6.8%)  Loop Strength Reduction
   0.3389 (  3.7%)   0.0506 ( 62.5%)   0.3895 (  4.2%)   0.3895 (  4.2%)  Greedy Register Allocator

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 7.4862 seconds (7.4859 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   5.6957 ( 76.2%)   0.0007 (  5.0%)   5.6963 ( 76.1%)   5.6963 ( 76.1%)  Type Legalization
   1.1657 ( 15.6%)   0.0053 ( 39.9%)   1.1710 ( 15.6%)   1.1709 ( 15.6%)  Instruction Scheduling
   0.4399 (  5.9%)   0.0008 (  5.9%)   0.4407 (  5.9%)   0.4406 (  5.9%)  DAG Combining after legalize types

apivovarov · July 3, 2019, 7:36am

@tqchen @HungMingWu @inouehrs Looks like using clang is more reliable approach than llvm_module.cc SaveToFile to generate .o Object files.

We can use SaveToFile to generate assembly language file lib.ll and then use clang to generate object lib.o file

clang -O3 -fPIC --target=<triple> lib.ll -c -o lib.o

This approach will solve SaveToFile hanging issue for i386, i686, x86 32 bit platforms
Also, it will be easier to trouble shoot object file generation because there are lots of info on how to use clang online.