Hybrid script GPU schedule

Laurawly · February 21, 2019, 11:10pm

How to write a correct schedule for simple hybrid script like:

    for i in range(batch_size):
        valid_count[i] = 0
        for j in bind('threadIdx.x', num_anchors):
            score = data[i, j, 1]
            if score > score_threshold:
                for k in bind('threadIdx.y', box_data_length):
                    out_tensor[i, valid_count[i], k] = data[i, j, k]
                valid_count[i] += 1
            if j > valid_count[i]:
                for k in bind('threadIdx.y', box_data_length):
                    out_tensor[i, j, k] = -1.0

Currently, the default schedule https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/vision.py#L10 which works for ir_builder doesn’t work for hybrid script with error forget binding…
@were

were · February 21, 2019, 11:47pm

Can you please paste the whole code instead of snippet so that I can replicate your issue quickly?

Laurawly · February 22, 2019, 12:17am

OK, I write a script for you to run:

gist.github.com

https://gist.github.com/Laurawly/7dcbbe37f7cf3b37ef096c4daa27c3bf

test_get_valid_counts.py

import tvm
import topi
import numpy as np
from tvm.intrin import sqrt

from tvm import api, hybrid


@hybrid.script
def hybrid_get_valid_counts_gpu(data, score_threshold):

This file has been truncated. show original

were · February 22, 2019, 5:01am

produce hybrid_get_valid_counts_gpu {
  // attr [0] extern_scope = 0
  // attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
  // attr [score] storage_scope = "global"
  allocate score[float32 * 1]
  // attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
  for (i, 0, 1) {
    hybrid_get_valid_counts_gpu.v0[i] = 0
    score[0] = data[(((threadIdx.x + (i*2500))*6) + 1)]
    if ((0.000000f < score[0])) {
      hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (i*15000)) + (hybrid_get_valid_counts_gpu.v0[i]*6))] = data[((threadIdx.y + (threadIdx.x*6)) + (i*15000))]
      hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
    }
    if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
      hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (threadIdx.x*6)) + (i*15000))] = -1.000000f
    }
  }
}

tvm._ffi.base.TVMError: [20:57:22] ~/tvm/src/codegen/codegen_cuda.cc:242: Check failed: scope != "global" (global vs. global)

were · February 22, 2019, 5:01am

// attr [score] storage_scope = "global"
allocate score[float32 * 1]
produce hybrid_get_valid_counts_gpu {
  // attr [0] extern_scope = 0
  for (i, 0, 1) {
    hybrid_get_valid_counts_gpu.v0[i] = 0
    // attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
    // attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
    score[0] = data[((((i*2500) + threadIdx.x)*6) + 1)]
    if ((0.000000f < score[0])) {
      hybrid_get_valid_counts_gpu.v1[((((i*2500) + hybrid_get_valid_counts_gpu.v0[i])*6) + threadIdx.y)] = data[(((i*15000) + threadIdx.y) + (threadIdx.x*6))]
      hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
    }
    if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
      hybrid_get_valid_counts_gpu.v1[(((i*15000) + threadIdx.y) + (threadIdx.x*6))] = -1.000000f
    }
  }
}

ValueError: Direct host side access to device memory is detected in default_function. Did you forget to bind?

vinx13 · February 22, 2019, 5:38am

I met this before. Local variables in python will be translated to a global var in IR. @were maybe you have idea on this [Hybrid] variables on CUDA should have 'local' scope

were · February 22, 2019, 5:43am

OK. I now know the problem.
Leyuan uses single variable in under a “bind” loop body.
This is bad in both current TVM and hybrid script.
In CUDA, a local variable will be lowered to register.
However, TVM has no CUDA register abstraction.
Thus, using single variable under bind body should be banned.
If you REALLY want to do it, you should allocate a scratchpad outside the bind body.

Laurawly · February 24, 2019, 10:53am

@were Thanks a lot for solving the issue. I have another error with your updated hybrid script running the same script I linked above.

CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
 grid=(1,1,1),  block=(2500,6,1)
// func_name=default_function_kernel0
// CUDA Source
// -----------
extern "C" __global__ void default_function_kernel0( int* __restrict__ hybrid_get_valid_counts_gpu_v0,  float* __restrict__ data,  float* __restrict__ hybrid_get_valid_counts_gpu_v1) {
   float score[1];
  for (int i = 0; i < 1; ++i) {
    hybrid_get_valid_counts_gpu_v0[i] = 0;
    score[0] = data[((1 + (((int)threadIdx.x) * 6)) + (i * 15000))];
    if (0.000000e+00f < score[0]) {
      hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (i * 15000)) + (hybrid_get_valid_counts_gpu_v0[i] * 6))] = data[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))];
      hybrid_get_valid_counts_gpu_v0[i] = (1 + hybrid_get_valid_counts_gpu_v0[i]);
    }
    if (hybrid_get_valid_counts_gpu_v0[i] < ((int)threadIdx.x)) {
      hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))] = -1.000000e+00f;
    }
  }
}

It seems that hybrid script cannot detect max_num_threads(tvm.target.current_target(allow_none=False).max_num_threads) and allocate accordingly. In this case, I’m using too many threads. How shall I allocate only max number of threads available in the device.

were · February 24, 2019, 11:02pm

Can you elaborate a little more?
To be explicit, what are you expecting to be injected to the IR?

Laurawly · February 25, 2019, 6:35pm

I’m expecting blockIdx.x could be automatically bind to correct thread_extent since it doesn’t allow user to get tvm.target.current_target(allow_none=False).max_num_threads in hybrid script.

were · February 25, 2019, 6:47pm

Look at this PR! I added a function intrinsic max_num_threads in hybrid script.