# Problem about generating matrix multiply CUDA code

#1

Hello, I am doing the matrix multiply in TVM, that is , P=M*N，where M(N0, K), N(K, M0), and I wish TVM can generate some similar code as I write:

``````__global__ void cut3_on_gpu(float* M, float* N, float* P, int N0, int K, int M0)
{
// define shared memory used in the block
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < (K-1)/TILE_WIDTH+1; ++p) {
if(Row<N0 && p*TILE_WIDTH+tx < K)
ds_M[ty][tx] = M[Row * K + p*TILE_WIDTH+tx];
else
ds_M[ty][tx] = 0.0;
if(p*TILE_WIDTH+ty<K && Col < M0)
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*M0 + Col];
else
ds_N[ty][tx] = 0.0;

if(Row < N0 && Col < M0)
{
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
}

}

if(Row<N0 && Col < M0)
P[Row*M0+Col] = Pvalue;
}
``````

But I am not sure how I can do Collaborative loading into shared memory and do calculation multiple times.Anyone can tell me some solutions or suggestions about the problem? Thanks!

#2

#3

Thanks, I am looking the code, but I am stuck, would you mind explaining the function `compute_at` ? I find there exists few comments in TVM, only:

``````    """Attach the stage at parent's scope
``````
``````    Parameters
----------
parent : Stage
The parent stage

scope : IterVar
The loop scope t be attached to.
"""``````

#4

It can be used to attach a stage to a inner loop. You can see the difference by printing the lowered ir