AutoTunner error (error_no=1)

cbalint13 · December 28, 2018, 1:27pm

Hi,

I am getting errors using AutoTVM.

I am using cuda10 & llvm7m
I pass “-ccbin /usr/bin/cuda-gcc” over tvm/contrib/nvcc.py as option, to make sure GCC version 7.3 (cuda compatible).

Spent some time by debugging the autotvm process and all generated C code compiles fine into .ptx.
But have no idea why validity of the kernels are rejected.

$ wget https://raw.githubusercontent.com/dmlc/tvm/master/tutorials/autotvm/tune_conv2d_cuda.py

$ python3 tune_conv2d_cuda.py

ConfigSpace (len=10454400, space_map=
0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003014.4647906) [(‘tile_f’, [128, 4, 1, 1]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6665122
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.020939111709594727, timestamp=1546003013.933304) [(‘tile_f’, [2, 16, 16, 1]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [16, 4, 8]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7461118
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.0171053409576416, timestamp=1546003013.9334168) [(‘tile_f’, [2, 8, 16, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [1, 4, 128]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6957588
No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.023586273193359375, timestamp=1546003013.933508) [(‘tile_f’, [128, 4, 1, 1]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [2, 1, 256]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 0)],None,377962
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02170276641845703, timestamp=1546003014.5508878) [(‘tile_f’, [4, 8, 8, 2]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [2, 256, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7580402
No: 6 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003015.4716434) [(‘tile_f’, [32, 1, 4, 4]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [64, 8, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6205875
No: 7 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.014199495315551758, timestamp=1546003014.8930523) [(‘tile_f’, [4, 1, 4, 32]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [16, 4, 8]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2039594
No: 8 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.01955866813659668, timestamp=1546003014.893228) [(‘tile_f’, [16, 16, 1, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [4, 32, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,4344839
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 9 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02602696418762207, timestamp=1546003015.5407526) [(‘tile_f’, [1, 1, 4, 128]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [128, 1, 4]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6843315
No: 10 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.01750779151916504, timestamp=1546003015.5466475) [(‘tile_f’, [2, 4, 1, 64]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [2, 1, 256]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,5411762
No: 11 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.014719486236572266, timestamp=1546003015.5467696) [(‘tile_f’, [2, 4, 4, 16]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [8, 2, 32]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6342777
No: 12 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003018.0453963) [(‘tile_f’, [2, 8, 1, 32]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [256, 1, 2]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2361008
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 13 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.017080307006835938, timestamp=1546003018.0889275) [(‘tile_f’, [1, 64, 2, 4]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 512, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2162934
No: 14 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003019.7524147) [(‘tile_f’, [32, 1, 8, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 2, 32]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9051979
No: 15 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.016710519790649414, timestamp=1546003019.188715) [(‘tile_f’, [32, 8, 2, 1]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6278153
No: 16 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.013083219528198242, timestamp=1546003019.1888168) [(‘tile_f’, [8, 8, 8, 1]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [2, 2, 128]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,5406530
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 17 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003022.3853087) [(‘tile_f’, [4, 1, 8, 16]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [512, 1, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6390079
No: 18 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012630701065063477, timestamp=1546003020.3712323) [(‘tile_f’, [4, 1, 8, 16]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2017139
No: 19 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003022.957653) [(‘tile_f’, [32, 8, 1, 2]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [32, 4, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,4333618
No: 20 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.016761302947998047, timestamp=1546003021.8294916) [(‘tile_f’, [1, 4, 4, 32]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [4, 4, 32]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7311456

vinx13 · December 28, 2018, 3:30pm

The tuning starts with some random value, which are likely to be invalid. You can set a larger n_trial (>200) in https://github.com/dmlc/tvm/blob/3516cbe0049c7e11ee58afbc668acddb1f110ece/tutorials/autotvm/tune_conv2d_cuda.py#L183

cbalint13 · December 28, 2018, 4:07pm

Still bad, even with n_trial = 2000.

In addition now also appears:
---------<<<------
Too many errors happen in the tuning. Now is in debug mode
WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
-------->>>--------

FrozenGene · December 28, 2018, 8:41pm

From the error log, I think the key point is the __builtin_expect error. Could you tell us why it looks for /usr/include/c++/8 ? Seems that GCC 8, right? But you say nvcc-gcc is GCC 7.3. Could you make sure this environment information?

cbalint13 · December 28, 2018, 9:48pm

@FrozenGene

tvm&xgboost was compiled with host compiler gcc8, except cuda parts (gcc7.3 alias cuda-gcc):

 -DUSE_CUDA=ON \
 -DUSE_CUDNN=ON \
 -DUSE_CUBLAS=ON \
 -DCUDA_PROPAGATE_HOST_FLAGS=OFF \
 -DCUDA_SELECT_NVCC_ARCH_FLAGS="Auto" \
 -DCUDA_HOST_COMPILER="/usr/bin/cuda-gcc" \

It is not possible to compile anything with nvcc & gcc8 (will throw incompatibility error).
Added some debug in contrib/nvcc.py that shows correct compilation of kernels, see e.g. line:

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpxxthqfo2/my_kernel.ptx’, ‘/tmp/tmpxxthqfo2/my_kernel.cu’]}

Also, upper compile sequence generates the .ptx just fine (no compilation errors), using gcc7.3 .
I do not understood (yet) where these generated .ptx are instantiated to further debug in the runtime.
I am wondering too about /usr/include/c++/8/bits/stl_vector.h:932, i think it came from an empty vector list that is accessed somewhere (perhaps used in some later statistics about things returned from kernels).
I redone the test by completle removal of gcc8 (and all related .rpm sub-packages), to make sure gcc8 is not called in any way, but tvm still fails.

Repost partial logs, with cmd shell pipe debug from nvcc.py:

ConfigSpace (len=10454400, space_map=
0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.03849625587463379, timestamp=1546032903.7520049) [(‘tile_f’, [16, 8, 1, 4]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 16, 32]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 0)],None,1124083
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02529001235961914, timestamp=1546032903.752144) [(‘tile_f’, [16, 1, 8, 4]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 128, 4]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9966781
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012875080108642578, timestamp=1546032903.7522757) [(‘tile_f’, [1, 4, 8, 16]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [16, 8, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,3759321
No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012762069702148438, timestamp=1546032903.7523637) [(‘tile_f’, [1, 64, 8, 1]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [4, 2, 64]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,3453373

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpxxthqfo2/my_kernel.ptx’, ‘/tmp/tmpxxthqfo2/my_kernel.cu’]}

/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.019515514373779297, timestamp=1546032903.859914) [(‘tile_f’, [2, 32, 8, 1]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [32, 16, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7565392
No: 6 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.015964508056640625, timestamp=1546032903.8601024) [(‘tile_f’, [2, 1, 128, 2]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [4, 128, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2737337
No: 7 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=4, timestamp=1546032906.4737227) [(‘tile_f’, [8, 8, 1, 8]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [128, 4, 1]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2526839
No: 8 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.050591468811035156, timestamp=1546032905.895398) [(‘tile_f’, [4, 8, 4, 4]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [1, 64, 8]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9215918

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpytumlzli/my_kernel.ptx’, ‘/tmp/tmpytumlzli/my_kernel.cu’]}

FrozenGene · December 28, 2018, 10:05pm

How about LLVM? I will suggest rebuilding TVM / LLVM using GCC 7.3. Make the default GCC compiler be GCC 7.3. After this, we don’t pass -ccbin /usr/bin/cuda-gcc and change nvcc.py. If we still have issue, we can make sure it is not related with environment issue, we can investigate more and go further.

vinx13 · December 29, 2018, 2:15am

InstantiationError(‘Skipped because of invalid gpu kernel’) is thrown when VerifyGPUCode pass failed. You can add some log in https://github.com/dmlc/tvm/blob/3516cbe0049c7e11ee58afbc668acddb1f110ece/src/pass/verify_gpu_code.cc#L55 to print the exact reason that the kernel is invalid

cbalint13 · December 29, 2018, 7:56am

@FrozenGene,

Found the cause of issue (a compile flag) but need to further investigate why such flag is bad.

@vinix13

Will keep in mind the location of IR verification and will debug further (using the troublesome C flags), thanks for the hint !

So, recompiled using host gcc8 and nvcc&gcc7.3 pair as usual but without the standard fedora/redhat specific additional flags (used by every single .rpm in the distro):

See the flags (auto)added in standard way:

 ~/rpmbuild/BUILD/twm/build ~/rpmbuild/BUILD/twm
CFLAGS='
-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 
-Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong 
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'

Now tvm’s autotune example works fine.
To record, LLVM is 7.0.1 and seems to be fine too.

I believe stack-protection interfere somewhere inside twm, need to find it out ! Perhaps is a vector list somewhere that is not properly handled (i.e. accessed when is empty),

I will try propose two PRs:

Allow options from userland, e.g. like tv,.target.create(“cuda”, options=["-ccbin",“cuda-nvcc”])
Fix to work even with flags such -fexceptions -fstack-protector-strong, i am sure there is some uncovered thing in the code.

snowolfhawk · August 29, 2019, 10:58am

I have the same problem, do you mean that recompiling llvm without these CFLAGS?

cbalint13 · August 29, 2019, 6:30pm

@snowolfhawk,

No, only very TVM itself need to be compiled without those RedHat/Fedora specific hardening flags (BTW, those flags are added by rpmbuild suite).