ROCm apps/benchmark/gpu_imagenet.py failures loading kernel


#1

I have been unable to run the apps/benchmark/gpu_imagenet.py script with ROCm. I am able to run basic ROCm capabilities otherwise.

issue details


#2

Sorry for my late replay, yes, I’m aware of this issue but haven’t looked into it. If I recall correctly, until ROCm 1.9 everything was working fine. After I upgraded to ROCm 2.0 or 2.1 (I forgot the exact version) end to end workload using NNVM stopped working with the same error as yours.

Do you know any relevant API changes on ROCm side? Our ROCm runtime backend hasn’t changed since its inception. Do you see any problem in our ROCm API usage?

If this is not a ROCm’s problem, then it is definitely our compiler’s problem.


#3

There is also this strange HSA_STATUS_ERROR_INVALID_ISA issue mentioned in this thread. Until ROCm 1.8 or 1.9, everything was working great (as demonstrated by the AutoTVM benchmark result).

I don’t know why our ROCm backend is currently broken, but since our AMDGPU codegen and runtime haven’t changed since last year, I think it has something to do with changes on ROCm side during ROCm 2 transition.


#4

I haven’t tried anything older than ROCm 2.1 but see the same behavior between ROCm 2.1 and ROCm 2.5. These versions aren’t too difficult to try since there is a “rocm-terminal” docker container and repo.radeon.com has the older packages directly downloadable. It will take some me a little bit of time, but will see if I can create an older version.

By the way, I looked at tutorial examples that used a cuda backend, and did trivial conversions to substitute rocm as target and as runtime and listed what works here: https://github.com/mvermeulen/rocm-tvm/blob/master/tutorial/README.txt
so I think basic plumbing is in place - but see some individual examples like the relay tutorial and benchmark script not working successfully.

I am not an expert in the APIs but the run time version in particular look pretty simple. Also trying to add some trace/logging at compile time (unless already there) since that at least is where I’m guessing the benchmark script issue comes.


#5

can you try this script? This script uses Relay to compile and run VGG and Resnet.

My current ROCm install is v2.3 with LLVM 6.0. For VGG everything seems to be working. But for Resnet (you can uncomment at the end of the script to run Resnet), it gives me HSA_STATUS_ERROR_INVALID_ISA error.

So given that I’m not getting hipErrorNotFound error when compiling with Relay, I think there is some problem with NNVM compilation (gpu_benchmark.py uses NNVM, not Relay). I expect the same issue applies to NVPTX backend.

I’ll look into the NNVM compilation issue. Once NNVM is fixed, I’ll try NNVM compilation on Resnet (this should work, since the AutoTVM result of the last year was obtained with NNVM).


#6

I was able to run the simplified gpu_imagenet_bench.py below with NNVM (in the same directory as gpu_imagenet_bench.py). The key was to remove -model=gfx900 from the target string.

Even though I am running on gfx803, with target string “rocm -model=gfx900” I somehow get object code for gfx900 (I see ".hsa_code_object_isa 9,0,0,“AMD”,“AMDGPU” in my asm dump).

If I remember correctly, “-model=gfx900” is supposed to be used to select which pre tuned schedule is applied during codegen. So I can use the schedule trained with gfx900 even though my card is gfx900. It is never meant to affect the target architecture for codegen.

import argparse
import threading

import numpy as np

import tvm
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
import nnvm.compiler
import nnvm.testing
from util import get_network

network = "vgg-19" # works
#network = "resnet-18" # HSA_STATUS_ERROR_INVALID_ISA
net, params, input_shape, output_shape = get_network(network, batch_size=1)

target = tvm.target.create('rocm') # not rocm -model=gfx900
dtype = 'float32'
with nnvm.compiler.build_config(opt_level=3):
    graph, lib, params = nnvm.compiler.build(
        net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)
ctx = tvm.context(str(target), 0)
module = runtime.create(graph, lib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input('data', data_tvm)
module.set_input(**params)

# evaluate
ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=500)
prof_res = np.array(ftimer().results) * 1000  # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % (network, "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))

#7

However for Resnet it still gives INVALID_ISA error. I have no idea why.

@eqy Do you still have rocm 1.8 or 1.9 install ? If yes, can you reproduce the last year’s AutoTVM result with the latest TVM?


#8

I tried the test script. With vgg, it passes

root@chilecito:/home/mev# python3 test_rocm.py
Target:  rocm
...100%, 0.12 MB, 428 KB/s, 0 seconds passed
Cannot find config for target=rocm, workload=('dense', (1, 4096, 'float32'), (1000, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=rocm, workload=('dense', (1, 4096, 'float32'), (4096, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=rocm, workload=('dense', (1, 25088, 'float32'), (4096, 25088, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
max abs diff:  1.1641532e-10

With resnet18, it fails

root@chilecito:/home/mev# python3 test_rocm.py
Target:  rocm
Cannot find config for target=rocm, workload=('dense', (1, 512, 'float32'), (1000, 512, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ISA (0x100f) at file:mcwamp_hsa.cpp line:1195
Aborted (core dumped)

Configuration information likely doesn’t matter, but this was run using a docker container with ROCm 2.3 on Vega64 (gfx900).


#9

I am also able to run the simplified gpu_imagenet_bench.py script where -model=gfx900 was removed.

root@chilecito:/src/tvm/apps/benchmark# python3 simple_gpu_imagenet_bench.py 
WARNING:autotvm:Cannot find config for target=rocm, workload=('dense', (1, 25088, 'float32'), (4096, 25088, 'float32'), (4096, 'float32'), 0). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=rocm, workload=('dense', (1, 4096, 'float32'), (4096, 4096, 'float32'), (4096, 'float32'), 0). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=rocm, workload=('dense', (1, 4096, 'float32'), (1000, 4096, 'float32'), (1000, 'float32'), 0). A fallback configuration is used, which may bring great performance regression.
vgg-19               6.17 ms             (0.24 ms)

Even though I am running on Vega64 (gfx900), it looks like this option is related to script failure (and perhaps independent of what is going on with resnet18).


#10

After working around the --model= issue, I tried each of the networks listed in the benchmark script. Following were my results of what did/didn’t run

resnet-18       - core dump
resnet-34       - core dump
resnet-50       - OK
vgg-16          - OK
vgg-19          - OK
densenet-121 - core dump
inception_v3  - OK
mobilenet       - OK
mobilenet_v2  - core dump
squeezenet_v1.0 - OK
squeezenet_v1.1 - OK