Tvm autotuner consistently failing with larger matrix sizes

vmiheer · November 18, 2019, 4:31am

When I run matrix multiplication with size 1kX1kX1k the training runs. With sizes 2kX2kX2k thought it consistently fails with error: result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n [bt] (3) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x48) [0x14944c70a9f8]\n [bt] (2) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::RPCModuleNode::WrapRemote(void*)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x3b) [0x14944c76e45b]\n [bt] (1) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(tvm::runtime::RPCSession::CallFunc(void*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, tvm::runtime::PackedFunc const*)+0x175) [0x14944c776845]\n [bt] (0) /home/vaidya.56/ml/ctvm/lib/python3.7/site-packages/tvm-0.6.dev0-py3.7-linux-x86_64.egg/tvm/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x22) '),), error_no=4, all_cost=10.230645656585693, timestamp=1574050733.5890949)

I am able to run 2kX2kX2k c++ matmult though.

comaniac · November 18, 2019, 7:30am

Did you get the Relay module working without auto-tuning? Is your target GPU?

error_no=4 is related to the runtime error, which may be caused by vary reasons (e.g., out of device memory, etc). If AutoTVM consistently failed, then it’s highly possible that the Relay module will fail even without AutoTVM.

vmiheer · November 18, 2019, 6:31pm

Did you get the Relay module working without auto-tuning?

When I remove tuning calls, it says:

Cannot find config for target=llvm, workload=('matmul', 2048, 2048, 2048, 'float32'). A fallback 
configuration is used, which may bring great performance regression.
GFlops: 0.9504301626367904

So seems like yes it’s working.

Is your target GPU?

Nope just “llvm” (cpu).

vmiheer · November 20, 2019, 8:21pm

@comaniac, let me know if you have any other question .

comaniac · November 20, 2019, 8:23pm

Sorry I was working on other tasks in these days. Will try to reproduce the problem today or tomorrow if possible to locate the issue.

comaniac · November 21, 2019, 1:59am

What’s your matmul op comes from? Did you implement that by yourself following the tutorial? If you just want to get the matrix multiplication working, you can directly use the TOPI builtin dense. For example:

import tvm
from tvm import relay
B = 2048
I = 2048
O = 2048

x = relay.var("x", shape=(B, I), dtype=dtype)
w = relay.var("w", shape=(O, I), dtype=dtype)
net = relay.nn.dense(x, w) 
module = relay.Module.from_expr(net)
module = relay.transform.InferType()(module)
target = 'llvm -mcpu=???' # Your CPU model

tasks = autotvm.task.extract_from_program(module['main'], target=target,
                                          params={}, ops=(relay.op.nn.dense, ))
tune_tasks(tasks, **tuning_option)

If you prefer to know more details about how schedule template is implemented and how AutoTVM works, please post your implementation for fuether investigation.

vmiheer · November 22, 2019, 2:27am

Looking at your declarations of x and w I started thinking why it’s O, I and not I, O. I am looking at topi.nn.dense, which matches your input declaration as it will do $XW^{T}$ vs relay.dense seems to do $X \times W$. So performance wise dense won’t be same as matmult? I’ll try to check output of matmult vs dense.

comaniac · November 22, 2019, 4:29am

Doesn’t really matter. nn.dense just accepts a different data layout as the normal GEMM. It is mainly for fully-connected layer in DNN, after all.

vmiheer · November 22, 2019, 4:56am

Just to clarify for my understanding, if I were to do matmult of two tensors A, B using nn.dense I should do explicit transpose of B before feeding to primitive. And even though with this trick, I’ll get correct results the performance of doing this vs. doing matmult directly would be different.
Also I think there should be tutorial for running relay function on the lines of this. Maybe such tutorial already exists and I didn’t find it?
Thanks a bunch for answering! .

comaniac · November 22, 2019, 5:06am

The short answer is yes, you have to explicitly transpose B first. I believe this layout is better for DNN workloads and that’s why it is designed in this way.

For the tutorial, we have a similar one but that’s for TOPI. I didn’t remember we have one for Relay. You are welcome to contribute one if possible

vmiheer · November 22, 2019, 5:07am

Awesome, thank you. I’ll create a small issue to ask the best choice executer for various use cases and write the tutorial .