Hybrid script GPU schedule

How to write a correct schedule for simple hybrid script like:

    for i in range(batch_size):
        valid_count[i] = 0
        for j in bind('threadIdx.x', num_anchors):
            score = data[i, j, 1]
            if score > score_threshold:
                for k in bind('threadIdx.y', box_data_length):
                    out_tensor[i, valid_count[i], k] = data[i, j, k]
                valid_count[i] += 1
            if j > valid_count[i]:
                for k in bind('threadIdx.y', box_data_length):
                    out_tensor[i, j, k] = -1.0

Currently, the default schedule https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/vision.py#L10 which works for ir_builder doesn’t work for hybrid script with error forget binding…
@were

Can you please paste the whole code instead of snippet so that I can replicate your issue quickly?

OK, I write a script for you to run:

produce hybrid_get_valid_counts_gpu {
  // attr [0] extern_scope = 0
  // attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
  // attr [score] storage_scope = "global"
  allocate score[float32 * 1]
  // attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
  for (i, 0, 1) {
    hybrid_get_valid_counts_gpu.v0[i] = 0
    score[0] = data[(((threadIdx.x + (i*2500))*6) + 1)]
    if ((0.000000f < score[0])) {
      hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (i*15000)) + (hybrid_get_valid_counts_gpu.v0[i]*6))] = data[((threadIdx.y + (threadIdx.x*6)) + (i*15000))]
      hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
    }
    if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
      hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (threadIdx.x*6)) + (i*15000))] = -1.000000f
    }
  }
}

tvm._ffi.base.TVMError: [20:57:22] ~/tvm/src/codegen/codegen_cuda.cc:242: Check failed: scope != "global" (global vs. global)
// attr [score] storage_scope = "global"
allocate score[float32 * 1]
produce hybrid_get_valid_counts_gpu {
  // attr [0] extern_scope = 0
  for (i, 0, 1) {
    hybrid_get_valid_counts_gpu.v0[i] = 0
    // attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
    // attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
    score[0] = data[((((i*2500) + threadIdx.x)*6) + 1)]
    if ((0.000000f < score[0])) {
      hybrid_get_valid_counts_gpu.v1[((((i*2500) + hybrid_get_valid_counts_gpu.v0[i])*6) + threadIdx.y)] = data[(((i*15000) + threadIdx.y) + (threadIdx.x*6))]
      hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
    }
    if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
      hybrid_get_valid_counts_gpu.v1[(((i*15000) + threadIdx.y) + (threadIdx.x*6))] = -1.000000f
    }
  }
}

ValueError: Direct host side access to device memory is detected in default_function. Did you forget to bind?

I met this before. Local variables in python will be translated to a global var in IR. @were maybe you have idea on this [Hybrid] variables on CUDA should have 'local' scope

OK. I now know the problem.
Leyuan uses single variable in under a “bind” loop body.
This is bad in both current TVM and hybrid script.
In CUDA, a local variable will be lowered to register.
However, TVM has no CUDA register abstraction.
Thus, using single variable under bind body should be banned.
If you REALLY want to do it, you should allocate a scratchpad outside the bind body.

@were Thanks a lot for solving the issue. I have another error with your updated hybrid script running the same script I linked above.

CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
 grid=(1,1,1),  block=(2500,6,1)
// func_name=default_function_kernel0
// CUDA Source
// -----------
extern "C" __global__ void default_function_kernel0( int* __restrict__ hybrid_get_valid_counts_gpu_v0,  float* __restrict__ data,  float* __restrict__ hybrid_get_valid_counts_gpu_v1) {
   float score[1];
  for (int i = 0; i < 1; ++i) {
    hybrid_get_valid_counts_gpu_v0[i] = 0;
    score[0] = data[((1 + (((int)threadIdx.x) * 6)) + (i * 15000))];
    if (0.000000e+00f < score[0]) {
      hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (i * 15000)) + (hybrid_get_valid_counts_gpu_v0[i] * 6))] = data[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))];
      hybrid_get_valid_counts_gpu_v0[i] = (1 + hybrid_get_valid_counts_gpu_v0[i]);
    }
    if (hybrid_get_valid_counts_gpu_v0[i] < ((int)threadIdx.x)) {
      hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))] = -1.000000e+00f;
    }
  }
}

It seems that hybrid script cannot detect max_num_threads(tvm.target.current_target(allow_none=False).max_num_threads) and allocate accordingly. In this case, I’m using too many threads. How shall I allocate only max number of threads available in the device.

Can you elaborate a little more?
To be explicit, what are you expecting to be injected to the IR?

I’m expecting blockIdx.x could be automatically bind to correct thread_extent since it doesn’t allow user to get tvm.target.current_target(allow_none=False).max_num_threads in hybrid script.


Look at this PR! I added a function intrinsic max_num_threads in hybrid script.