[AutoTuning] Specifying autotvm runner for local GPUs on the same machine


#1

Hello,

When specifying the autoTVM measure_option for the runner on GPUs, what should be the right option if there are multiple GPUs on the single machine and we want to choose which GPU(s) to use for the tuning process? Also, do we need to specify the context for the runner, like the context we provide when creating a TVM runtime? If not, how does the runner figure out which GPU device it is going to measure the performance?

Best,
Minjia


#2

You need to use RPC Tracker to manage multiple devices.
(https://docs.tvm.ai/tutorials/autotvm/tune_relay_cuda.html#scale-up-measurement-by-using-multiple-devices)

You can run one RPC server per card and specify which card to use by the environment variable CUDA_VISIBLE_DEVICES


#3

Hi @merrymercy

Thank you for your reply! Yes, I’ve tried to use the environment variable to register the GPU device to the rpc server, as follow:
CUDA_VISIBLE_DEVICES=0 python3 -m tvm.exec.rpc_server --key titanv100 --tracker=0.0.0.0:9190

However, I’m seeing an error that seems to related to running the generated library through rpc on this GPU. The error happens in the run_through_rpc() function in tvm/python/tvm/autotvm/measure/measure_methods.py

When doing autotuning on GPU, AutoTVM seems to use rpc to load generated module (e.g., /tmp/tmpn57rz_kj/tmp_func_92eda338bf50c2d9.tar) using remote.load_module, but the loading is failing when calling _loadRemoteModule() in tvm/python/tvm/rpc/client.py.

run_through_rpc() (in autotvm/measure/measure_methods.py )
remote = request_remote(*remote_args)
remote.upload(build_result.filename)
func = remote.load_module(os.path.split(build_result.filename)[1]) // This line is failing. build_result.filename is '/tmp/tmp3l8m03wv/tmp_func_8c2bb1b7b10e4148.tar’
ctx = remote.context(str(measure_input.target), 0)

def load_module(self, path):
    return base._LoadRemoteModule(self._sess, path) **// This line calls into ABCMeta.__instancecheck__ in /usr/lib/python3.5/abc.py and throws an exception.** 

def __instancecheck__(cls, instance):
    """Override for isinstance(instance, cls)."""
    # Inline the cache checking
    subclass = instance.__class__
    if subclass in cls._abc_cache:
        return True
    subtype = type(instance)
    if subtype is subclass:
        if (cls._abc_negative_cache_version ==
            ABCMeta._abc_invalidation_counter and
            subclass in cls._abc_negative_cache):
            return False        **// The method returns False here**
        # Fall back to the subclass check.
        return cls.__subclasscheck__(subclass)
    return any(cls.__subclasscheck__(c) for c in {subclass, subtype})

python3 -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
Tracker address 0.0.0.0:9190
Server List
----------------------------
server-address key
----------------------------
127.0.0.1:42390 server:titanv100
----------------------------

Queue Status
---------------------------------
key total free pending
---------------------------------
**titanv100 1 1 0 **
---------------------------------

Do you know what might go wrong?