Problem about generating matrix multiply CUDA code

deadlywing · June 14, 2019, 2:28am

Hello, I am doing the matrix multiply in TVM, that is , P=M*N，where M(N0, K), N(K, M0), and I wish TVM can generate some similar code as I write:

__global__ void cut3_on_gpu(float* M, float* N, float* P, int N0, int K, int M0)
{
// define shared memory used in the block
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < (K-1)/TILE_WIDTH+1; ++p) {
// Collaborative loading of M and N tiles into shared memory
// boundary check when loading data
	if(Row<N0 && p*TILE_WIDTH+tx < K)
		ds_M[ty][tx] = M[Row * K + p*TILE_WIDTH+tx];
	else
		ds_M[ty][tx] = 0.0;
	if(p*TILE_WIDTH+ty<K && Col < M0)
		ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*M0 + Col];
	else
		ds_N[ty][tx] = 0.0;
__syncthreads();

if(Row < N0 && Col < M0)
{
	for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
}

__syncthreads();
}

if(Row<N0 && Col < M0)
	P[Row*M0+Col] = Pvalue;
}

But I am not sure how I can do Collaborative loading into shared memory and do calculation multiple times.Anyone can tell me some solutions or suggestions about the problem? Thanks!

vinx13 · June 14, 2019, 2:59am

you can bind loading of shared data to multiple threads:

github.com

dmlc/tvm/blob/2b045c560a7bac6852affb2ed5fc9df550405a2f/topi/python/topi/cuda/conv2d_direct.py#L100-L109


# cooperative fetching
for load in [AA, WW]:
    n, f, y, x = s[load].op.axis
    fused = s[load].fuse(n, f, y, x)
    tz, fused = s[load].split(fused, nparts=cfg["tile_f"].size[2])
    ty, fused = s[load].split(fused, nparts=cfg["tile_y"].size[2])
    tx, fused = s[load].split(fused, nparts=cfg["tile_x"].size[2])
    s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
    s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
    s[load].bind(tx, tvm.thread_axis("threadIdx.x"))

deadlywing · June 14, 2019, 3:32am

Thanks, I am looking the code, but I am stuck, would you mind explaining the function compute_at ? I find there exists few comments in TVM, only:

    """Attach the stage at parent's scope

    Parameters
    ----------
    parent : Stage
        The parent stage

    scope : IterVar
        The loop scope t be attached to.
    """

vinx13 · June 14, 2019, 7:23am

It can be used to attach a stage to a inner loop. You can see the difference by printing the lowered ir