AutoTunner error (error_no=1)

Hi,

I am getting errors using AutoTVM.

  • I am using cuda10 & llvm7m
  • I pass “-ccbin /usr/bin/cuda-gcc” over tvm/contrib/nvcc.py as option, to make sure GCC version 7.3 (cuda compatible).

Spent some time by debugging the autotvm process and all generated C code compiles fine into .ptx.
But have no idea why validity of the kernels are rejected.

$ wget https://raw.githubusercontent.com/dmlc/tvm/master/tutorials/autotvm/tune_conv2d_cuda.py

$ python3 tune_conv2d_cuda.py

ConfigSpace (len=10454400, space_map=
0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003014.4647906) [(‘tile_f’, [128, 4, 1, 1]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6665122
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.020939111709594727, timestamp=1546003013.933304) [(‘tile_f’, [2, 16, 16, 1]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [16, 4, 8]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7461118
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.0171053409576416, timestamp=1546003013.9334168) [(‘tile_f’, [2, 8, 16, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [1, 4, 128]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6957588
No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.023586273193359375, timestamp=1546003013.933508) [(‘tile_f’, [128, 4, 1, 1]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [2, 1, 256]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 0)],None,377962
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02170276641845703, timestamp=1546003014.5508878) [(‘tile_f’, [4, 8, 8, 2]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [2, 256, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7580402
No: 6 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003015.4716434) [(‘tile_f’, [32, 1, 4, 4]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [64, 8, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6205875
No: 7 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.014199495315551758, timestamp=1546003014.8930523) [(‘tile_f’, [4, 1, 4, 32]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [16, 4, 8]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2039594
No: 8 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.01955866813659668, timestamp=1546003014.893228) [(‘tile_f’, [16, 16, 1, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [4, 32, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,4344839
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 9 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02602696418762207, timestamp=1546003015.5407526) [(‘tile_f’, [1, 1, 4, 128]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [128, 1, 4]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6843315
No: 10 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.01750779151916504, timestamp=1546003015.5466475) [(‘tile_f’, [2, 4, 1, 64]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [2, 1, 256]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,5411762
No: 11 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.014719486236572266, timestamp=1546003015.5467696) [(‘tile_f’, [2, 4, 4, 16]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [8, 2, 32]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6342777
No: 12 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003018.0453963) [(‘tile_f’, [2, 8, 1, 32]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [256, 1, 2]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2361008
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 13 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.017080307006835938, timestamp=1546003018.0889275) [(‘tile_f’, [1, 64, 2, 4]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 512, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2162934
No: 14 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003019.7524147) [(‘tile_f’, [32, 1, 8, 2]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 2, 32]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9051979
No: 15 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.016710519790649414, timestamp=1546003019.188715) [(‘tile_f’, [32, 8, 2, 1]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6278153
No: 16 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.013083219528198242, timestamp=1546003019.1888168) [(‘tile_f’, [8, 8, 8, 1]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [2, 2, 128]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,5406530
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 17 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003022.3853087) [(‘tile_f’, [4, 1, 8, 16]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [512, 1, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,6390079
No: 18 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012630701065063477, timestamp=1546003020.3712323) [(‘tile_f’, [4, 1, 8, 16]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [8, 16, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2017139
No: 19 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=200, timestamp=1546003022.957653) [(‘tile_f’, [32, 8, 1, 2]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [32, 4, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,4333618
No: 20 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.016761302947998047, timestamp=1546003021.8294916) [(‘tile_f’, [1, 4, 4, 32]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [4, 4, 32]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7311456

The tuning starts with some random value, which are likely to be invalid. You can set a larger n_trial (>200) in https://github.com/dmlc/tvm/blob/3516cbe0049c7e11ee58afbc668acddb1f110ece/tutorials/autotvm/tune_conv2d_cuda.py#L183

Still bad, even with n_trial = 2000.

  • In addition now also appears:
    ---------<<<------
    Too many errors happen in the tuning. Now is in debug mode
    WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
    -------->>>--------

From the error log, I think the key point is the __builtin_expect error. Could you tell us why it looks for /usr/include/c++/8 ? Seems that GCC 8, right? But you say nvcc-gcc is GCC 7.3. Could you make sure this environment information?

@FrozenGene

  • tvm&xgboost was compiled with host compiler gcc8, except cuda parts (gcc7.3 alias cuda-gcc):

     -DUSE_CUDA=ON \
     -DUSE_CUDNN=ON \
     -DUSE_CUBLAS=ON \
     -DCUDA_PROPAGATE_HOST_FLAGS=OFF \
     -DCUDA_SELECT_NVCC_ARCH_FLAGS="Auto" \
     -DCUDA_HOST_COMPILER="/usr/bin/cuda-gcc" \
    
  • It is not possible to compile anything with nvcc & gcc8 (will throw incompatibility error).

  • Added some debug in contrib/nvcc.py that shows correct compilation of kernels, see e.g. line:

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpxxthqfo2/my_kernel.ptx’, ‘/tmp/tmpxxthqfo2/my_kernel.cu’]}

  • Also, upper compile sequence generates the .ptx just fine (no compilation errors), using gcc7.3 .

  • I do not understood (yet) where these generated .ptx are instantiated to further debug in the runtime.

  • I am wondering too about /usr/include/c++/8/bits/stl_vector.h:932, i think it came from an empty vector list that is accessed somewhere (perhaps used in some later statistics about things returned from kernels).

  • I redone the test by completle removal of gcc8 (and all related .rpm sub-packages), to make sure gcc8 is not called in any way, but tvm still fails.


  • Repost partial logs, with cmd shell pipe debug from nvcc.py:

ConfigSpace (len=10454400, space_map=
0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.03849625587463379, timestamp=1546032903.7520049) [(‘tile_f’, [16, 8, 1, 4]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 16, 32]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 0)],None,1124083
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.02529001235961914, timestamp=1546032903.752144) [(‘tile_f’, [16, 1, 8, 4]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [1, 128, 4]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9966781
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012875080108642578, timestamp=1546032903.7522757) [(‘tile_f’, [1, 4, 8, 16]), (‘tile_y’, [1, 1, 1, 7]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [16, 8, 4]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 0)],None,3759321
No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.012762069702148438, timestamp=1546032903.7523637) [(‘tile_f’, [1, 64, 8, 1]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [4, 2, 64]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,3453373

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpxxthqfo2/my_kernel.ptx’, ‘/tmp/tmpxxthqfo2/my_kernel.cu’]}

/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::reference = char&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion ‘__builtin_expect(__n < this->size(), true)’ failed.
No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.019515514373779297, timestamp=1546032903.859914) [(‘tile_f’, [2, 32, 8, 1]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [32, 16, 1]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 1)],None,7565392
No: 6 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.015964508056640625, timestamp=1546032903.8601024) [(‘tile_f’, [2, 1, 128, 2]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [1, 1, 7, 1]), (‘tile_rc’, [4, 128, 1]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2737337
No: 7 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=4, timestamp=1546032906.4737227) [(‘tile_f’, [8, 8, 1, 8]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [1, 1, 1, 7]), (‘tile_rc’, [128, 4, 1]), (‘tile_ry’, [1, 3, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 512), (‘unroll_explicit’, 0)],None,2526839
No: 8 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’),), error_no=1, all_cost=0.050591468811035156, timestamp=1546032905.895398) [(‘tile_f’, [4, 8, 4, 4]), (‘tile_y’, [1, 1, 7, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [1, 64, 8]), (‘tile_ry’, [1, 1, 3]), (‘tile_rx’, [3, 1, 1]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,9215918

cmd: {[‘nvcc’, ‘–ptx’, ‘-O3’, ‘-arch’, ‘sm_61’, ‘-ccbin’, ‘cuda-gcc’, ‘-o’, ‘/tmp/tmpytumlzli/my_kernel.ptx’, ‘/tmp/tmpytumlzli/my_kernel.cu’]}

How about LLVM? I will suggest rebuilding TVM / LLVM using GCC 7.3. Make the default GCC compiler be GCC 7.3. After this, we don’t pass -ccbin /usr/bin/cuda-gcc and change nvcc.py. If we still have issue, we can make sure it is not related with environment issue, we can investigate more and go further.

InstantiationError(‘Skipped because of invalid gpu kernel’) is thrown when VerifyGPUCode pass failed. You can add some log in https://github.com/dmlc/tvm/blob/3516cbe0049c7e11ee58afbc668acddb1f110ece/src/pass/verify_gpu_code.cc#L55 to print the exact reason that the kernel is invalid

@FrozenGene,

Found the cause of issue (a compile flag) but need to further investigate why such flag is bad.

@vinix13

Will keep in mind the location of IR verification and will debug further (using the troublesome C flags), thanks for the hint !

So, recompiled using host gcc8 and nvcc&gcc7.3 pair as usual but without the standard fedora/redhat specific additional flags (used by every single .rpm in the distro):

See the flags (auto)added in standard way:

 ~/rpmbuild/BUILD/twm/build ~/rpmbuild/BUILD/twm
CFLAGS='
-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 
-Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong 
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
  • Now tvm’s autotune example works fine.
  • To record, LLVM is 7.0.1 and seems to be fine too.

I believe stack-protection interfere somewhere inside twm, need to find it out ! Perhaps is a vector list somewhere that is not properly handled (i.e. accessed when is empty),

I will try propose two PRs:

  1. Allow options from userland, e.g. like tv,.target.create(“cuda”, options=["-ccbin",“cuda-nvcc”])

  2. Fix to work even with flags such -fexceptions -fstack-protector-strong, i am sure there is some uncovered thing in the code.

I have the same problem, do you mean that recompiling llvm without these CFLAGS?

@snowolfhawk,

No, only very TVM itself need to be compiled without those RedHat/Fedora specific hardening flags (BTW, those flags are added by rpmbuild suite).