AutoTVM tutorial produces 21 conv2d schedules for "llvm" and only 12 conv2d schedules for "cuda"

sergiomatiz · February 14, 2020, 12:31am

Hi eveyrone,

When I run the tutorial “Auto-tuning a convolutional network for x86 CPU” I see 21 lines for the 21 convolutions (ResNet-18) in the log_optimal. However, when I run the tutorial “Auto-tuning a convolutional network for NVIDIA GPU” I get only 12 lines for 12 convolutions in the log_optimal, even though it is the same network (or that is what I understand), does anybody know why this could happen?

In addition, do you know why is it that you do not get any schedules for the other layers? Are the other layers not being tuned, or maybe they get fused with the convolutions?

I would appreciate any help regarding this issue

comaniac · February 13, 2020, 10:49pm

Could you post the workloads that only appear at the LLVM target?

sergiomatiz · February 14, 2020, 1:35pm

Hi, Thanks a lot for your prompt response. I have modified the description above to be more specific about the tutorials I am running.

Please find the optimal configuration for “llvm” (resnet-18) below. I have enumerated the layers for convenience. One difference for instance, is that for “llvm” there are 3 schedules corresponding to 3 convolutions of size [1 512 7 7], whereas in the cuda log, as shown in my reply below this one, there is only 1 schedule for 1 convolution of size [1 512 7 7].

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 3, 224, 224], “float32”], [“TENSOR”, [64, 3, 7, 7], “float32”], [2, 2], [3, 3, 3, 3], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 3, 224, 224, “float32”], [64, 3, 7, 7, “float32”], [2, 2], [3, 3, 3, 3], [1, 1], “NCHW”, “float32”], {“i”: 136, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 1]], [“tile_oc”, “sp”, [-1, 32]], [“tile_ow”, “sp”, [-1, 1]], [“unroll_kw”, “ot”, false]]}], “r”: [[0.01014757672815534], 0, 2.5155513286590576, 1581436858.4419928], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 82, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009950462727272727], 0, 2.7125132083892822, 1581437816.7206779], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 1, 1, “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“i”: 425, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 1]], [“tile_oh”, “ot”, 2]]}], “r”: [[0.0011312621484992102], 0, 3.207817316055298, 1581441951.768949], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 82, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009950462727272727], 0, 2.7125132083892822, 1581437816.7206779], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [128, 64, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [128, 64, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 89, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.005048907204472844], 0, 4.021136522293091, 1581443233.6869035], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [128, 128, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 101, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009989094614814816], 0, 3.1271796226501465, 1581446749.781529], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [128, 64, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [128, 64, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“i”: 368, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 16]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 1]], [“tile_oh”, “ot”, 2]]}], “r”: [[0.0005921782860057119], 0, 3.293405532836914, 1581446003.794353], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [128, 128, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 101, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009989094614814816], 0, 3.1271796226501465, 1581446749.781529], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [256, 128, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [256, 128, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 109, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 32]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 2]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.005018088178321678], 0, 3.9786384105682373, 1581449992.5711432], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [256, 256, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 188, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 256]], [“tile_oc”, “sp”, [-1, 4]], [“tile_ow”, “sp”, [-1, 7]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009956693457142857], 0, 3.0830090045928955, 1581453318.564188], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [256, 128, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [256, 128, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“i”: 324, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 16]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 1]], [“tile_oh”, “ot”, 2]]}], “r”: [[0.0005877056827348746], 0, 3.120971918106079, 1581451718.8096752], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [256, 256, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 188, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 256]], [“tile_oc”, “sp”, [-1, 4]], [“tile_ow”, “sp”, [-1, 7]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009956693457142857], 0, 3.0830090045928955, 1581453318.564188], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [512, 256, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [512, 256, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 305, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 256]], [“tile_oc”, “sp”, [-1, 8]], [“tile_ow”, “sp”, [-1, 7]], [“unroll_kw”, “ot”, false]]}], “r”: [[0.005143994603278689], 0, 4.259227514266968, 1581454611.755201], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 512, 7, 7], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 7, 7, “float32”], [512, 512, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 128, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 256]], [“tile_oc”, “sp”, [-1, 4]], [“tile_ow”, “sp”, [-1, 7]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009949430639639639], 0, 2.7014334201812744, 1581456113.8993032], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [512, 256, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [512, 256, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“i”: 223, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 128]], [“tile_oc”, “sp”, [-1, 16]], [“tile_ow”, “sp”, [-1, 1]], [“tile_oh”, “ot”, 2]]}], “r”: [[0.0005641878775181305], 0, 3.1547939777374268, 1581454752.9796696], “v”: 0.1}

{“i”: [“llvm”, “topi_x86_conv2d_NCHWc”, [[“TENSOR”, [1, 512, 7, 7], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 7, 7, “float32”], [512, 512, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“i”: 128, “c”: null, “t”: “direct”, “e”: [[“tile_ic”, “sp”, [-1, 256]], [“tile_oc”, “sp”, [-1, 4]], [“tile_ow”, “sp”, [-1, 7]], [“unroll_kw”, “ot”, true]]}], “r”: [[0.009949430639639639], 0, 2.7014334201812744, 1581456113.8993032], “v”: 0.1}

sergiomatiz · February 14, 2020, 1:25pm

For CUDA, same resnet-18. It only returns schedule for 12 layers (I wonder why). Inference time is very close to what is shown in the website (1.15ms)

{“r”: [[5.301165607765277e-05], 0, 3.416822910308838, 1581603631.1435277], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 7, 7], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 7, 7, “float32”], [512, 512, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 16, 4]], [“tile_x”, “sp”, [-1, 1, 8, 2]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “winograd”, “i”: 190206}]}

{“r”: [[1.3176676375972394e-05], 0, 36.70834970474243, 1581605845.2810543], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [512, 256, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [512, 256, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 2, 16, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 7, 1]], [“tile_rc”, “sp”, [-1, 16]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “direct”, “i”: 79235}]}

{“r”: [[8.606766642780366e-05], 0, 3.5750463008880615, 1581607432.4720902], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [512, 256, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [512, 256, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 2, 4, 2]], [“tile_y”, “sp”, [-1, 1, 7, 1]], [“tile_x”, “sp”, [-1, 1, 1, 1]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “direct”, “i”: 612993}]}

{“r”: [[3.844542276161389e-05], 0, 31.977830171585083, 1581609095.2137506], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [256, 256, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 2, 16, 2]], [“tile_x”, “sp”, [-1, 7, 7, 1]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “winograd”, “i”: 37032}]}

{“r”: [[7.893610215623896e-06], 0, 10.36836051940918, 1581611070.8119206], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [256, 128, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [256, 128, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 4, 16, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 14, 1]], [“tile_rc”, “sp”, [-1, 16]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “direct”, “i”: 865952}]}

{“r”: [[4.448707873573967e-05], 0, 38.539387226104736, 1581613358.0153213], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [256, 128, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [256, 128, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 2, 8, 1]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 1, 7, 2]], [“tile_rc”, “sp”, [-1, 16]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “direct”, “i”: 7970845}]}

{“r”: [[3.138099692142953e-05], 0, 13.580451726913452, 1581616193.9938211], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [128, 128, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 8, 4]], [“tile_x”, “sp”, [-1, 7, 28, 1]], [“tile_rc”, “sp”, [-1, 32]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “winograd”, “i”: 447559}]}

{“r”: [[6.707924643584522e-06], 0, 15.57893967628479, 1581618869.209506], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [128, 64, 1, 1], “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [128, 64, 1, 1, “float32”], [2, 2], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 4, 8, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 2, 14, 1]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “direct”, “i”: 3340823}]}

{“r”: [[3.371866992843581e-05], 0, 3.500437021255493, 1581621338.547618], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [128, 64, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [128, 64, 3, 3, “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 1, 16, 4]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 1, 7, 2]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “direct”, “i”: 26036002}]}

{“r”: [[7.36766881134491e-06], 0, 8.708645343780518, 1581623659.4519675], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 1, 1, “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 8, 8, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 2, 28, 1]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “direct”, “i”: 20616981}]}

{“r”: [[2.785163836772983e-05], 0, 46.4191677570343, 1581626071.820149], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 3, 3, “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 2, 8, 2]], [“tile_x”, “sp”, [-1, 7, 28, 1]], [“tile_rc”, “sp”, [-1, 32]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 1]], “c”: null, “t”: “winograd”, “i”: 279680}]}

{“r”: [[3.802579433070866e-05], 0, 9.277168035507202, 1581628867.5058868], “v”: 0.1, “i”: [“cuda -model=unknown”, “topi_nn_conv2d”, [[“TENSOR”, [1, 3, 224, 224], “float32”], [“TENSOR”, [64, 3, 7, 7], “float32”], [2, 2], [3, 3, 3, 3], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 3, 224, 224, “float32”], [64, 3, 7, 7, “float32”], [2, 2], [3, 3, 3, 3], [1, 1], “NCHW”, “float32”], {“e”: [[“tile_f”, “sp”, [-1, 2, 8, 4]], [“tile_y”, “sp”, [-1, 8, 1, 1]], [“tile_x”, “sp”, [-1, 1, 14, 1]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 7]], [“tile_rx”, “sp”, [-1, 7]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 0]], “c”: null, “t”: “direct”, “i”: 23438078}]}

comaniac · February 14, 2020, 6:05pm

If you printed out all extracted tasks before tuning, you can find that ResNet-18 includes 12 tasks (conv2d workloads), so the CUDA log makes sense. Note that the conv2d with the same shapes and attributes will be extracted only once.

I guess the reason you got 3 schedules for a conv2d in LLVM log is because of the graph tuning. Graph tuning may select different schedules for two different conv2d layers even they have the same shapes and attributes, because it considers data layout transform overhead.

cc @kevinthesun

kevinthesun · February 14, 2020, 6:08pm

For x86, graph tuning will generate one optimal schedule for each conv2d layer, while for gpu only distinct conv2d workload has schedule, since graph tuning is not required.

sergiomatiz · February 14, 2020, 6:27pm

Thank you very much for your responses, now I understand the reason behind the differences