Conv2d does not respect thread limit

I ran the get started script (https://tvm.apache.org/docs/_downloads/39e437b36375e97c049f64073eade7a6/relay_quick_start.py) with a lightly modified version of TVM that allows opencl to target accelerator and print some execution trace and I noticed that despite the fact that I set max_num_threads at 128, the first conv2d kernel had a local size of 224 threads which lead to an opencl error. I know that it selects the schedule in the same way as a GPU and I believe it is using nchw schedule but I have trouble finding my way in the operator implementation so I was not able to check if the schedule is responsible for this. Where in the code is the local work size determined for opencl targets? and why is the thread limit bypassed? Is it because of some predefined cuda autotvm settings?

I found out why it used 224 threads instead of 128. It was because of a tophub config being used for the convolution. While this feature is great for common hardware, it can be confusing for unorthodox device. I think there should be more visible documentation about this feature and maybe a proper way, such as a single variable, to disable downloading new configurations instead of editing the code.

Besides I noticed that setting -device=custom_accelerator didn’t trigger a fallback config and a Nvidia GPU config was used. Is it good that when an unknown device is specified it does this? I mean for Cuda it is because it is always Nvidia’s GPU but OpenCL is used on much more diverse hardware and it can lead to trouble.