OpenCL AutoTVM questions

I’m trying to enable AutoTVM for OpenCL (intel_graphics target). So far I managed to have some success in that area, but values are multiple times worst than for generic scheduler.

To begin with I am focusing only on conv2d operation (since this is also only one currently present in intel_graphics TOPI). I’ve used conv2d_direct.py file from CUDA to use it as a dummy test file (this scheduler seemed to be the easiest) and get some idea what is required to write my own one. There are few things I don’t understand and I’d appreciate guidance on how such scheduler should be written/what values should I provide. Two most pressing questions for now are:

  1. From where you have all of the splitting and other numerical values in schedule_direct_cuda?

  2. How you decided about tvm.thread_axis threads?

The CUDA direct template is meant to be fairly general purpose and follows the general pattern of splitting found for many dense operators on GPU. The template itself is a distillation of general knowledge about how to write schedules for NVIDIA GPUs. In this case, it is easier to view the transformed schedule to see how it works (e.g., in terms of locality, reuse) than to try to copy each scheduling primitive.

Thread axes extents are basically determined by the tuning process, since they are bound directly to loop axes.

Another approach is to build the intel template incrementally. You can gradually replace hardcoded values in the intel schedule with configuration options. The performance should never decrease from the original hardcoded schedule during this process.

Thanks for good suggestions.

During investigating current intel_graphics conv2d.py I’d appreciate explanation why you have decided to use in decl_conv2d extra hand-written _decl_cl_spatialpack instead of nn.conv2d_nchw? The later option is present in CUDA and in x86 decls. So why not in OpenCL?

@eqy, I’d appreciate if you could answer below questions.

  1. Why you have decided to create new decl (_decl_cl_spatialpack) for intel_graphics, rather than using already existing one, for example nn.conv2d_nchw?

  2. Is the below mentioned general knowledge based on some NVidia’s white paper? Or on some set of documents? Could you please point to some useful resources?

  1. What is the reason of using split with nparts/factor equal 1? With some basic examples I noticed, that this isn’t splitting on inner and outer IterVar, but just renames current IterVar to .outer
        z_factor=1
        y_factor=1
        coo, coi = s[conv].split(co, nparts=1)
        ooho, oohi = s[conv].split(ooh, factor=z_factor)
        oowo, oowi = s[conv].split(oow, factor=y_factor)

1 & 3, these are written by @Laurawly who should know more details.

  1. Not in reference to any specific guide, but an emergent pattern found in many dense CUDA kernels. I recommend the CUDA programming guide for a description of the memory hierarchy: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

Thank you for answer. I’ll check those materials.

Hi @sebap, thanks for raising the questions. We use separate conv2d schedule for intel_graphics because we want to leverage the “subgroup” intel opencl extension: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.html here. Just like what is heavily used in clDNN library.

Thank you @Laurawly for info.
Would you be able to explain a bit more on the topic? Especially I’d like to better understand which parts of the intel_graphics scheduler are getting main benefits from subgroups?

@sebap The places I use ‘warp’ such as here: https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L484, I’m trying to create subgroups. And data movement associated with those variables would trigger TVM IR to be translated to subgroup related functions like shuffle.