A deploy problem after getting .so from autotune on x86 cpu using C++

Hi , @kevinthesun
everything goes ok, exception the output is wrong.
cv::Mat tensor = cv::dnn::blobFromImage(inputImageAligned,1.0,cv::Size(256,256),cv::Scalar(0,0,0),true);
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
//const int64_t in_shape[in_ndim] = {1, 3, 256, 256};
const int64_t in_shape[in_ndim] = {1, 256, 256, 3}; (did the autotune using layout of NWHC, input_shape = (1, 256, 256,3)) , or in_shape was not set in the right way for deploy?
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);
TVMArrayCopyFromBytes(input,tensor.data,2563256*4);

when doing the autotune, I got the right output result as follows:
tvm_output = module.get_output(0, tvm.nd.empty(((65536,2)), ‘float32’))
tvm_output_to_numpy = tvm_output.asnumpy()
mask_1 = tvm_output_to_numpy[:,1].reshape(256,256)
mask_2 = tvm_output_to_numpy[:,0].reshape(256,256)

could I get the right output format in the following way for deployment in C++?
tvm::runtime::PackedFunc get_output = mod->GetFunction(“get_output”);
tvm::runtime::NDArray res = get_output(0);
cv::Mat vector(65536,2,CV_32F);
memcpy(vector.data,res->data,6553642);
cv::Mat mask = vector.reshape(2, 256).clone();

thanks a lot!

it may be caused by the function: cv::dnn::blobFromImage, which returns the layout format as NCHW: {1 ,3 , 256 ,256}, is there a good way , in which is less time-consuming to change effectively the HWC to NHWC using opencv in C++?

Thanks a lot!

TVM internally use NCHW for optimization. However, you don’t need to transpose the data layout manually, since frontend converter should take care of that. For example, tensorflow converter will insert transpose for NHWC layout input.

Thanks! Since I used NHWC layout when doing the autotune on the x86 cpu , with input:{1,256,256,3}. After I got the result .so, .params, .json, I finally want to do the deploying in C++ with this three files. So it may be required to have the same layout as the one for doing autotune on X86 cpu.

when I set the input_shape=(1,3, 256, 256) for net, params = relay.frontend.from_tensorflow(graph_def, layout=layout, shape={‘preprocess/truediv’: input_shape}), the autune got an error like this**: **Unable to unify parent types: TensorType([16, 256, 3, 3], float32) and TensorType([16, 3, 3, 3], float32)****. but when the input_shape changed to (1,256, 256, 3), the autune runs sucessfully. and I did the check for the output from aututune in python, i did the following preprocessing for the input image:
img = cv2.resize(img,(256,256),interpolation=cv2.INTER_CUBIC)
img_array_256 = np.array(img)
tvm_input = img_array.reshape(1,256,256,3)
module.set_input(‘preprocess/truediv’, data_tvm)
In this way, I got the right output from the .so after the autotune.

Thanks a lot!

hello dolphinetear

i set the input_shape(1,3,256,256) and layout is NCHW . and i can successfully get the net and params. the turn kernels is successful . but the turn graph is wrong .

Traceback (most recent call last):
File “tune_relay_x86.py”, line 225, in
tune_and_evaluate(tuning_option)
File “tune_relay_x86.py”, line 199, in tune_and_evaluate
tune_graph(net, data_shape, log_file, graph_opt_sch_file)
File “tune_relay_x86.py”, line 172, in tune_graph
executor.run()
File “/home/lisas/tvm/tvm/python/tvm/autotvm/graph_tuner/dynamic_programming_tuner.py”, line 188, in run
self._backward()
File “/home/lisas/tvm/tvm/python/tvm/autotvm/graph_tuner/dynamic_programming_tuner.py”, line 93, in _backward
num_states = states_list[0][3].size
IndexError: list index out of range

i am looking for your relpy

best
lisa shi

did you ran the model from tensorflow? there are some errors in tensorflow frontend file: python/tvm/relay/frontend/tensorflow.py

I had the similiar error before and bypassed it because tune_graph is used to speed up the aututune
and I used the regular way
#tune_graph(net, input_shape, log_file, graph_opt_sch_file)
tune_kernels(tasks, **tuning_option)
#with autotvm.apply_graph_best(graph_opt_sch_file):
with autotvm.apply_history_best(log_file):

You need to keep input shape as original but change layout to “NCHW”. By default tf converter assumes layout is “NHWC”, if you specify layout as “NCHW”, it will insert transpose before/after operators.

Hi, for this issue what model are you using? Can you print states_list to see what it is? Looks like it is an empty list. If this is the case, can you check what output_idx_list looks like?

The root cause for this graph tuning issue is that currently tensorflow converter inserts transposing operators everywhere when setting “NCHW” layout. This will make layout transformations unavoidable and graph tuning won’t help. I’ll file a fix to make graph tuner still return the best possible result. This best result should have no performance difference with “apply_history_best”.

thanks very much, @kevinthesun

there is a question about the data type. I used the unit8 type input for network when doing the auto-tuning. when doing the deployment in C++, I should set the int dtype_code = kDLUInt? for the following TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input)

thanks a lot!

hi kevinthesun
thank you for your help.
i need to set the layout as NCHW, because the tensorflow model has a tanspose layer as first. i have successfully run the model by from_tensorflow.py .

best
lisa shi

hi dolphintear
thank you for your help
first ,i have successfully run the model by from_tensorflow.py. second, i successfully run the model by tune_model_x86.py with your help . but the inference time is too long . can tune_graph shorten the inference time?

best
lisa shi