ROCm 'segmentation fault' error when auto-tuning

When I run a modified version of the tutorial file “tune_relay_cuda.py” (using target = “rocm”), I get the following error some time auto-tuning starts

Tuning...
Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'), kwargs={}, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'))
rocm
[Task  1/ 9]  Current/Best:   21.86/3183.80 GFLOPS | Progress: (60/100) | 174.55 s

Segmentation fault (core dumped)

I am using a Vega 20 AMD GPU and I was wondering if I should add the -model xx definition to the target to avoid this.

I was wondering if somebody has experienced the same issue in the past. Any information on this issue would be greatly appreciated

Youw will need to setup an RPC server explicitly as per https://github.com/apache/incubator-tvm/tree/master/apps/rocm_rpc due to a limitation of the rocm driver

1 Like

Hi @tqchen

Thank you for your prompt reply. I am following the instructions in the link you sent but, when executing the Makefile, I get

rocm_runtime_pack.cc:33:52: fatal error: …/…/src/contrib/miopen/conv_forward.cc: No such file or directory

I noticed that the directory

…/…/src/contrib/miopen

does not exist. I could find thew missing file in

…/…/src/runtime/contrib/miopen/

I modified the make file accordingly but then it cannot find myopen.h. Do you know what I may be missing to successfully make this app?

You will need to compile with miopen header in your include path. Alternatively, you can remove the miopen.cc, this won’t affect the autotvm part

Given that it happens after 60 steps, this might not be ROCm but rather the xgboost module. In that case, upgrading to the pre-release or downgrading helps.

That said we also fixed a potential segfault in the AMDGPU llvm codegen last week, so upgrading to the latest TVM master might be a good idea.

Best regards

Thomas

1 Like

Downgrading to xgboost 0.90 fixed the segmentation fault issue!

Thanks a lot @t-vi

1 Like

Hi @t-vi,

I have one follow-up question. I was wondering if you know the location of the file defining the schedule for the ROCm backend conv2d. So far I have checked the file in the link below, but I haven’t been able to find the schedule template. I would appreciate any information on this regard.

Currently, we use the CUDA schedule (and op) on ROCm: