Will "global" cache_read be precomputed for constant input?

For example, in the conv2d cases, I’d like to cache_read kernel weights to another global buffer and do some tiling there before calculation. Will cache_read be precomputed in this case?

Is this in the context of VTA? or are you targeting another backend?

I am now targeting arm_cpu backend

I’m not sure I understand your question: could you be more specific about what you mean by pre-computing the cache_read?

Please see the usage in https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/depthwise_conv2d.py#L62
I am just curious if the kernel packing can be computed in advance(should be possible because kernel weights are constant) or must be computed at runtime?

It won’t be computed in advance. Because it is a single TVM op, while precompute is a NNVM pass which works on nnvm symbol.

So what we do is to separate this conv into two nnvm symbols, one does convolution and the other does weight transform. Then the weight transform symbol can be pre-computed.

We do this by registering alter_op_layout in nnvm. See relate code for conv2d

@merrymercy The layout altering will not happen in autotvm path, so need use debug_skip_region to exclude it from tuning time measurement. Is my understanding correct?

You are right.

Great, thanks a lot for clarification.