Why is gpu 4 times slower than cpu on rk3399

I have auto tune same network on rk3399 cpu and gpu.
The cpu has got ,Mean inference time (std dev): 713.19 ms (3.16 ms), while on gpu ,
Mean inference time (std dev): 3995.28 ms (15.74 ms).

The auto tune  for gpu has compile warnings :
WARNING:autotvm:Cannot find config for target=opencl -device=mali, workload=('conv2d_transpose_nchw', (1, 96, 32, 32, 'float32'), (96, 32, 4, 4, 'float32'), (2, 2), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

384
Tensor(shape=[1, 384, 32, 32], op.name=compute)

192
Tensor(shape=[1, 192, 32, 32], op.name=compute)

576
Tensor(shape=[1, 576, 32, 32], op.name=compute)
WARNING:autotvm:Cannot find config for target=opencl -device=mali, workload=(‘conv2d_transpose_nchw’, (1, 160, 16, 16, ‘float32’), (160, 32, 4, 4, ‘float32’), (2, 2), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression.