I have a small test case where I do a 2-d matrix addition.
A = tvm.placeholder((4,16), name=‘input1’)
B = tvm.placeholder((4,16), name=‘input2’)
C = tvm.compute((4,16),
lambda m,n: A[m,n] + B[m,n],
I tile on the leading dimension by 2 and parallelize as follows:
m,n = s[C].op.axis
m_outer,m_inner = s[C].split(m,factor=2)
I then match the inner matrix add with a tensorize routine:
And I add my own c function that does the matrix add as a packed func.
I print the address of the pointer passed into the routine. When there is a single thread(TVM_NUM_THREADS=1) the pointers are updated correctly prior to being passed to the function
But when there are 2 threads are more, the input pointer is not offset for threads other than the first one.