A deploy problem after getting .so from autotune on x86 cpu using C++

dolphintear · June 26, 2019, 8:14am

Hi , @kevinthesun
everything goes ok, exception the output is wrong.
cv::Mat tensor = cv::dnn::blobFromImage(inputImageAligned,1.0,cv::Size(256,256),cv::Scalar(0,0,0),true);
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
//const int64_t in_shape[in_ndim] = {1, 3, 256, 256};
const int64_t in_shape[in_ndim] = {1, 256, 256, 3}; (did the autotune using layout of NWHC, input_shape = (1, 256, 256,3)) , or in_shape was not set in the right way for deploy?
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);
TVMArrayCopyFromBytes(input,tensor.data,2563256*4);

when doing the autotune, I got the right output result as follows:
tvm_output = module.get_output(0, tvm.nd.empty(((65536,2)), ‘float32’))
tvm_output_to_numpy = tvm_output.asnumpy()
mask_1 = tvm_output_to_numpy[:,1].reshape(256,256)
mask_2 = tvm_output_to_numpy[:,0].reshape(256,256)

could I get the right output format in the following way for deployment in C++?
tvm::runtime::PackedFunc get_output = mod->GetFunction(“get_output”);
tvm::runtime::NDArray res = get_output(0);
cv::Mat vector(65536,2,CV_32F);
memcpy(vector.data,res->data,6553642);
cv::Mat mask = vector.reshape(2, 256).clone();

thanks a lot!

dolphintear · June 26, 2019, 10:38am

it may be caused by the function: cv::dnn::blobFromImage, which returns the layout format as NCHW: {1 ,3 , 256 ,256}, is there a good way , in which is less time-consuming to change effectively the HWC to NHWC using opencv in C++?

Thanks a lot!

kevinthesun · June 26, 2019, 9:44pm

TVM internally use NCHW for optimization. However, you don’t need to transpose the data layout manually, since frontend converter should take care of that. For example, tensorflow converter will insert transpose for NHWC layout input.

dolphintear · June 27, 2019, 1:58am

Thanks! Since I used NHWC layout when doing the autotune on the x86 cpu , with input:{1,256,256,3}. After I got the result .so, .params, .json, I finally want to do the deploying in C++ with this three files. So it may be required to have the same layout as the one for doing autotune on X86 cpu.

when I set the input_shape=(1,3, 256, 256) for net, params = relay.frontend.from_tensorflow(graph_def, layout=layout, shape={‘preprocess/truediv’: input_shape}), the autune got an error like this**: **Unable to unify parent types: TensorType([16, 256, 3, 3], float32) and TensorType([16, 3, 3, 3], float32)****. but when the input_shape changed to (1,256, 256, 3), the autune runs sucessfully. and I did the check for the output from aututune in python, i did the following preprocessing for the input image:
img = cv2.resize(img,(256,256),interpolation=cv2.INTER_CUBIC)
img_array_256 = np.array(img)
tvm_input = img_array.reshape(1,256,256,3)
module.set_input(‘preprocess/truediv’, data_tvm)
In this way, I got the right output from the .so after the autotune.

Thanks a lot!

shixinli · June 27, 2019, 2:28am

hello dolphinetear

i set the input_shape(1,3,256,256) and layout is NCHW . and i can successfully get the net and params. the turn kernels is successful . but the turn graph is wrong .

Traceback (most recent call last):
File “tune_relay_x86.py”, line 225, in
tune_and_evaluate(tuning_option)
File “tune_relay_x86.py”, line 199, in tune_and_evaluate
tune_graph(net, data_shape, log_file, graph_opt_sch_file)
File “tune_relay_x86.py”, line 172, in tune_graph
executor.run()
File “/home/lisas/tvm/tvm/python/tvm/autotvm/graph_tuner/dynamic_programming_tuner.py”, line 188, in run
self._backward()
File “/home/lisas/tvm/tvm/python/tvm/autotvm/graph_tuner/dynamic_programming_tuner.py”, line 93, in _backward
num_states = states_list[0][3].size
IndexError: list index out of range

i am looking for your relpy

best
lisa shi

dolphintear · June 27, 2019, 3:58am

did you ran the model from tensorflow? there are some errors in tensorflow frontend file: python/tvm/relay/frontend/tensorflow.py

dolphintear · June 27, 2019, 4:30am

I had the similiar error before and bypassed it because tune_graph is used to speed up the aututune
and I used the regular way
#tune_graph(net, input_shape, log_file, graph_opt_sch_file)
tune_kernels(tasks, **tuning_option)
#with autotvm.apply_graph_best(graph_opt_sch_file):
with autotvm.apply_history_best(log_file):

kevinthesun · June 27, 2019, 10:02pm

You need to keep input shape as original but change layout to “NCHW”. By default tf converter assumes layout is “NHWC”, if you specify layout as “NCHW”, it will insert transpose before/after operators.

kevinthesun · June 27, 2019, 10:28pm

Hi, for this issue what model are you using? Can you print states_list to see what it is? Looks like it is an empty list. If this is the case, can you check what output_idx_list looks like?

kevinthesun · June 28, 2019, 12:29am

The root cause for this graph tuning issue is that currently tensorflow converter inserts transposing operators everywhere when setting “NCHW” layout. This will make layout transformations unavoidable and graph tuning won’t help. I’ll file a fix to make graph tuner still return the best possible result. This best result should have no performance difference with “apply_history_best”.

dolphintear · June 28, 2019, 1:02am

thanks very much, @kevinthesun

there is a question about the data type. I used the unit8 type input for network when doing the auto-tuning. when doing the deployment in C++, I should set the int dtype_code = kDLUInt? for the following TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input)

thanks a lot!

shixinli · July 1, 2019, 7:41am

hi kevinthesun
thank you for your help.
i need to set the layout as NCHW, because the tensorflow model has a tanspose layer as first. i have successfully run the model by from_tensorflow.py .

best
lisa shi

shixinli · July 1, 2019, 7:47am

hi dolphintear
thank you for your help
first ,i have successfully run the model by from_tensorflow.py. second, i successfully run the model by tune_model_x86.py with your help . but the inference time is too long . can tune_graph shorten the inference time？

best
lisa shi