Tvm autotuner consistently failing with larger matrix sizes

When I run matrix multiplication with size 1kX1kX1k the training runs. With sizes 2kX2kX2k thought it consistently fails with error: result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n [bt] (3) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x48) [0x14944c70a9f8]\n [bt] (2) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::RPCModuleNode::WrapRemote(void*)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x3b) [0x14944c76e45b]\n [bt] (1) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(tvm::runtime::RPCSession::CallFunc(void*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, tvm::runtime::PackedFunc const*)+0x175) [0x14944c776845]\n [bt] (0) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x22) '),), error_no=4, all_cost=10.230645656585693, timestamp=1574050733.5890949)

I am able to run 2kX2kX2k c++ matmult though.

Did you get the Relay module working without auto-tuning? Is your target GPU?

error_no=4 is related to the runtime error, which may be caused by vary reasons (e.g., out of device memory, etc). If AutoTVM consistently failed, then it’s highly possible that the Relay module will fail even without AutoTVM.

  1. Did you get the Relay module working without auto-tuning?

    When I remove tuning calls, it says:

    Cannot find config for target=llvm, workload=('matmul', 2048, 2048, 2048, 'float32'). A fallback 
    configuration is used, which may bring great performance regression.
    GFlops: 0.9504301626367904
    

    So seems like yes it’s working.

  2. Is your target GPU?

    Nope just “llvm” (cpu).

@comaniac, let me know if you have any other question :slight_smile:.

Sorry I was working on other tasks in these days. Will try to reproduce the problem today or tomorrow if possible to locate the issue.

1 Like

What’s your matmul op comes from? Did you implement that by yourself following the tutorial? If you just want to get the matrix multiplication working, you can directly use the TOPI builtin dense. For example:

import tvm
from tvm import relay
B = 2048
I = 2048
O = 2048

x = relay.var("x", shape=(B, I), dtype=dtype)
w = relay.var("w", shape=(O, I), dtype=dtype)
net = relay.nn.dense(x, w) 
module = relay.Module.from_expr(net)
module = relay.transform.InferType()(module)
target = 'llvm -mcpu=???' # Your CPU model

tasks = autotvm.task.extract_from_program(module['main'], target=target,
                                          params={}, ops=(relay.op.nn.dense, ))
tune_tasks(tasks, **tuning_option)

If you prefer to know more details about how schedule template is implemented and how AutoTVM works, please post your implementation for fuether investigation.

1 Like

Looking at your declarations of x and w I started thinking why it’s O, I and not I, O. I am looking at topi.nn.dense, which matches your input declaration as it will do $XW^{T}$ vs relay.dense seems to do $X \times W$. So performance wise dense won’t be same as matmult? I’ll try to check output of matmult vs dense.

Doesn’t really matter. nn.dense just accepts a different data layout as the normal GEMM. It is mainly for fully-connected layer in DNN, after all.

Just to clarify for my understanding, if I were to do matmult of two tensors A, B using nn.dense I should do explicit transpose of B before feeding to primitive. And even though with this trick, I’ll get correct results the performance of doing this vs. doing matmult directly would be different.
Also I think there should be tutorial for running relay function on the lines of this. Maybe such tutorial already exists and I didn’t find it?
Thanks a bunch for answering! :slight_smile:.

The short answer is yes, you have to explicitly transpose B first. I believe this layout is better for DNN workloads and that’s why it is designed in this way.

For the tutorial, we have a similar one but that’s for TOPI. I didn’t remember we have one for Relay. You are welcome to contribute one if possible :slight_smile:

1 Like

Awesome, thank you. I’ll create a small issue to ask the best choice executer for various use cases and write the tutorial :slight_smile:.