Why do we need a unpack stage in the schedule of direct conv2d for Mali?

Can I merge conv and output into one stage to reduce an OpenCL kernel?
like this

conv = tvm.compute((N, CO, OH, OW), lambda n, co, h, w:
data_vec[n, h//VH, w//VW, ci, (h%VH)*HSTR+kh, (w%VW)*WSTR+kw].astype(out_dtype) *
kernel_vec[co//VC, ci, kh, kw, co%VC].astype(out_dtype),
axis=[ci, kh, kw], name=‘conv’)

and schedule

n, c, h, w = s[conv].op.axis
c, vc = s[conv].split(c, VC)
h, vh = s[conv].split(h, VH)
w, vw = s[conv].split(w, VW)
s[conv].reorder(n, c, h, w, vh, vw, vc)

others are same as tvm/topi/python/topi/mali/conv2d.py

and can I change the schedule of stage data_vec to
s[data_vec].compute_inline()
or
s[data_vec].compute_at(s[conv], iterVar)

why do we need pre-pack the input data into a data_vec tensor? is this because pre-packing input data can reduce the if-else branches in the stage of conv?

but on some Mali gpu, the warp size is 1. we can use compute_inline or compute_at to reduce the data packing kernel without decreasing the performance?

This is for vectorization. You cannot merge them because we have to make c the inner most dimension for vectorization.