Autotvm skips depthwise_conv2d_nchw workloads for llvm x86_64

apivovarov · July 17, 2019, 2:32am

I tried to tune a model for intel x86_64 cpu , target=llvm
autotvm skips depthwise_conv2d_nchw workloads and generates invalid log lines

autotvm skips depthwise_conv2d_nchw workloads - autotvm time is only 1-2 sec

----------New Workloads---------------
('depthwise_conv2d_nchw', (1, 960, 9, 9, 'float32'), (960, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 576, 15, 15, 'float32'), (576, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 576, 16, 16, 'float32'), (576, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 384, 16, 16, 'float32'), (384, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 192, 29, 29, 'float32'), (192, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 192, 30, 30, 'float32'), (192, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 144, 57, 57, 'float32'), (144, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 144, 58, 58, 'float32'), (144, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 96, 113, 113, 'float32'), (96, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32')
('depthwise_conv2d_nchw', (1, 32, 114, 114, 'float32'), (32, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32')
--------------------------------------
Total: 10
Tuning...
[Task  1/10]  Current/Best:   14.69/  14.69 GFLOPS | Progress: (1/1000) | 1.18 s Done.
[Task  2/10]  Current/Best:    4.07/   4.07 GFLOPS | Progress: (1/1000) | 1.11 s Done.
[Task  3/10]  Current/Best:   13.57/  13.57 GFLOPS | Progress: (1/1000) | 1.21 s Done.
[Task  4/10]  Current/Best:    5.91/   5.91 GFLOPS | Progress: (1/1000) | 1.11 s Done.
[Task  5/10]  Current/Best:   14.96/  14.96 GFLOPS | Progress: (1/1000) | 1.14 s Done.
[Task  6/10]  Current/Best:    2.28/   2.28 GFLOPS | Progress: (1/1000) | 1.11 s Done.
[Task  7/10]  Current/Best:    5.76/   5.76 GFLOPS | Progress: (1/1000) | 1.15 s Done.
[Task  8/10]  Current/Best:    3.71/   3.71 GFLOPS | Progress: (1/1000) | 1.17 s Done.
[Task  9/10]  Current/Best:    4.51/   4.51 GFLOPS | Progress: (1/1000) | 1.27 s Done.
[Task 10/10]  Current/Best:    4.68/   4.68 GFLOPS | Progress: (1/1000) | 1.08 s Done.

As a result is generates invalid log lines:

{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 32, 114, 114], "float32"], ["TENSOR", [32, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 32, 114, 114, "float32"], [32, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0004918456], 0, 0.565997838973999, 1563309733.8285108], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 96, 113, 113], "float32"], ["TENSOR", [96, 1, 3, 3], "float32"], [2, 2], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 96, 113, 113, "float32"], [96, 1, 3, 3, "float32"], [2, 2], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0013306298], 0, 0.5449604988098145, 1563309735.648528], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 144, 58, 58], "float32"], ["TENSOR", [144, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 144, 58, 58, "float32"], [144, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0005992043], 0, 0.5612428188323975, 1563309737.5331435], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 144, 57, 57], "float32"], ["TENSOR", [144, 1, 3, 3], "float32"], [2, 2], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 144, 57, 57, "float32"], [144, 1, 3, 3, "float32"], [2, 2], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.00034402959999999997], 0, 0.5536956787109375, 1563309739.319174], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 192, 30, 30], "float32"], ["TENSOR", [192, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 192, 30, 30, "float32"], [192, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.00018112869999999999], 0, 0.5282695293426514, 1563309741.111374], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 192, 29, 29], "float32"], ["TENSOR", [192, 1, 3, 3], "float32"], [2, 2], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 192, 29, 29, "float32"], [192, 1, 3, 3, "float32"], [2, 2], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0002968735], 0, 0.5278668403625488, 1563309742.868396], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 384, 16, 16], "float32"], ["TENSOR", [384, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 384, 16, 16, "float32"], [384, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0002350044], 0, 0.5902194976806641, 1563309744.7191393], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 576, 16, 16], "float32"], ["TENSOR", [576, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 576, 16, 16, "float32"], [576, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.000547472], 0, 0.5910224914550781, 1563309746.4457006], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 576, 15, 15], "float32"], ["TENSOR", [576, 1, 3, 3], "float32"], [2, 2], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 576, 15, 15, "float32"], [576, 1, 3, 3, "float32"], [2, 2], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0001127131], 0, 0.6004319190979004, 1563309748.3819141], "v": 0.1}
{"i": ["llvm", "topi_nn_depthwise_conv2d_nchw", [["TENSOR", [1, 960, 9, 9], "float32"], ["TENSOR", [960, 1, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "float32"], {}, ["depthwise_conv2d_nchw", [1, 960, 9, 9, "float32"], [960, 1, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "float32"], {"i": 0, "t": "direct", "c": null, "e": []}], "r": [[0.0001809171], 0, 0.4957103729248047, 1563309750.1361008], "v": 0.1}

If I try to run evaluate script with these log lines I get the following error - KeyError: 'tile_ic'

  File "/usr/local/lib/python3.6/dist-packages/topi-0.6.dev0-py3.6.egg/topi/x86/conv2d.py", line 441, in _alter_conv2d_layout
    ic_bn, oc_bn = cfg["tile_ic"].size[-1], cfg["tile_oc"].size[-1]
  File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/autotvm/task/space.py", line 773, in __getitem__
    return self._entity_map[name]
KeyError: 'tile_ic'

I also tried to run autotvm for arm/mali - target=tvm.target.mali('rk3399') - it works fine.
The issues exist on pure llvm x86_64

I tried older TVM version. Results are below
Wed Jul 3 10:08:04 2019 - 287078c33db85d4f312d8d2457a064442d9d18c3 - Bad
Sun Jun 30 18:05:48 - 6c81d784dc9459d684604fcf4190fda4cb956c1c - Bad
Wed Jun 19 20:29:12 2019 - 05a5c170930ef649d6f196950e680ca16d30d07a - Bad
Mon Jun 10 21:26:15 - 7c1c97d2d8d0a99c752d43f95d92618b62b1f015 - Bad
Sat Jun 8 20:56:58 2019 - a4bc50ebfffe034490330a781ac077d958d43286 - Bad
Thu Jun 6 11:41:50 2019 - d7bc4fdd4789a730b7aadcaf441c3d50b9863f60 - Bad
Thu Jun 6 21:00:19 2019 - 770ac84e74a5d0cb174c1a5402f0752a5a8fbecb - OK
Wed Jun 5 22:03:12 2019 - 5999f7a6d8e174026b35dc938bb11442ffae6995 - OK
Fri May 31 19:42:15 2019 - f6acf2e5f51f9ac48f8d13e095805b7fe3f74bcf - OK

So, the issue was introduced in PR https://github.com/dmlc/tvm/pull/3264

I opened new issue: https://github.com/dmlc/tvm/issues/3557

apivovarov · July 17, 2019, 8:33pm

I forgot to include tasks replacement to my code

        # converting conv2d tasks to conv2d_NCHWc tasks
        op_name = tsk.workload[0]
        if op_name == 'conv2d':
            func_create = 'topi_x86_conv2d_NCHWc'
        elif op_name == 'depthwise_conv2d_nchw':
            func_create = 'topi_x86_depthwise_conv2d_NCHWc_from_nchw'
        else:
            raise ValueError("Tuning {} is not supported on x86".format(op_name))

        task = autotvm.task.create(func_create, args=tsk.args,
                                   target=target, template_key='direct')
        task.workload = tsk.workload

dominikstiller · August 12, 2019, 6:14pm

I am having the exact same issue, has this been resolved?

kevinthesun · August 12, 2019, 10:02pm

Have you tried the solution from @apivovarov?

dominikstiller · August 12, 2019, 10:37pm

No sorry, that did indeed fix the problem.

Would it make sense to do that task replacement in autotvm.task.extract_from_program or some other helper method, since it seems to be necessary everytime we want to autotune a depthwise_conv2d on x86?