Winograd variable tile_size knob

cbalint13 · July 31, 2019, 3:27pm

Discussion around on https://github.com/dmlc/tvm/pull/3642 .

For the XGB issue, trying to settle and ask:

Imagine we have two independent knobs A, B that holds finite lists of values:

if they are truely independent than xgb will be feed with constant sized statistics (fea).
if they are dependent on each other, e.g. the list of B change depending on what we choose from A then we trouble xgb with altered length & meaning of fea statistics.

Is this right in case of xgb ? In that case our hierarchical knob plan can be tricky, we have to be careful on the aspect that A & B together always yield same combinatorical list and same meaning/mapping of entries in the fea.

What would be your opionion here, is my statement right ?

merrymercy · August 1, 2019, 12:12pm

There are two problems.

Dependency between knobs
Currently, AutoTVM assumes no dependency between knobs. If two knobs A and B are dependent,
During space construction, we have to enumerate all possible values of A and B. This makes sure that we cover all possible points in the space. It is okay that some points in the space are invalid, because the invalid points will result compilation or runtime error, and will be dropped by AutoTVM.
So current AutoTVM works well in this case, but introducing a way to describe dependency between knobs can prune the space and make tuning faster.
Variable lengths of feature vector
Current XGBTuner with feature_type="itervar" only accepts fixed-length feature vector, which means all schedules in the search space must have the same loop structure and buffer access pattern (other AST nodes like “if” will be ignored during feature extraction)
We can always workaround this problem by using feature_type="knob". This workaround generally works well. (But dosing so makes the it a black-box method, which conflicts our goal of embedding domain knowledge into the machine learning models.)

Dependent knobs can yield fixed-length vectors. In the winograd example, there are some knobs that are dependent on tile_size (which is also a knob), but they all yield feature vectors of the same length.
Independent knobs can yield variable-length vectors. In this case, we have no choice but to use feature_type="knob" or feature_type="curve"

merrymercy · August 1, 2019, 12:18pm

To summary, the way current AutoTVM deals with dependent knobs is

Treat them as independent knobs but make sure their combinations cover all points we want
Drop invalid combinations by detecting compilation or runtime error.

But introducing hierarchical tuning space can prune the space in advance and make tuning faster.

cbalint13 · August 1, 2019, 12:35pm

@merrymercy,

As long as we return one singe value (constantly) out of knob and keep cfg space constant as amont of output parameters we don’t hit the issue mentioned upper.
I think i found out how to propose tile_bnb (subsqares dependent on P as input tensor volume/array), already getting very good performances on mali so far. Will re-tune some of available know kernels from tophub first on mali, then on arm to see/proof the exact yield.
Most important on 5x5 kernels and 7x7 (more less) also getting good results.

Example of 5x5 kernel from TOPHUB:
[Task  1/ 2 (1, 48, 35, 35)|(64, 48, 5, 5)] (conv2d) {26.26 GFLOPS /direct} (tophub)
[Task  1/ 2 (1, 48, 35, 35)|(64, 48, 5, 5)] (conv2d) {87.27 GFLOPS /winograd} (retuned)

But, to advance with the PR, how about such ‘conditional’ knob:

cfg.define_knob(‘tile_size’, [2,4,6])
cfg.define_conditional_knob(‘tile_bnb’, on=‘tile_size’, {2: [2,3,5,8], 4: [6,7,8], 6: [1, 2, 3]})

Can i implement it in this conditional_knob form to help out our PR ?

merrymercy · August 1, 2019, 12:44pm

I will send two fixes in 4 hours on

Memorize winograd matrix. (without this the tuning will be 2x slower)
Fix the hanging issue on feature extraction.

Then we can accept your PR.

The conditional_knob is not necessary, we can leave it as future work.
Currently, we can use

cfg.define_knob(‘tile_size’, [2,4,6])
cfg.define_conditional_knob(‘tile_bnb’, [1, 2, 3, 4, 5, 6, 7, 8])  # union of all possible values

to replace

cfg.define_knob(‘tile_size’, [2,4,6])
cfg.define_conditional_knob(‘tile_bnb’, on=‘tile_size’, {2: [2,3,5,8], 4: [6,7,8], 6: [1, 2, 3]})

cbalint13 · August 1, 2019, 1:01pm

@merrymercy,

OK, fine.

I was thinking to union too, here is my proposal focused on ‘mali’ below. The patch below is over the top on current PR.

Yes i touched unroll, yt, t1, t2 on mali , it yield way much better results.
Having tile_bnb the way is done below assures smooth convergence (very fast <500 iteration) for any 3x3.
No issues anymore on 3x3 like (cannot prove thread.x ) or other kinds of.
Still issue on 5x5 and 7x7 (many early cannot prove thread.x messages) but its tunnable with excelent final results. For this i would like to revisit once more tile_bnb proposal but with conditional_knob

Not had time yet to check ARM cpu side too, as said i would implement the conditional_knob first.

 --- a/topi/python/topi/mali/conv2d.py
+++ b/topi/python/topi/mali/conv2d.py
@@ -205,7 +205,6 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
         dilation_h, dilation_w = dilation
 
     if len(kernel.shape) == 4:
-
         if dilation_h != 1 or dilation_w != 1:
             kernel = dilate(kernel, (1, 1, dilation_h, dilation_w))
         pre_computed = False
@@ -237,14 +236,52 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
     P = N * nH * nW
 
     ##### space definition begin #####
-    tile_bna_candidates = [1, 2, 4, 8, 16]
+
+    ##
+    ## BNA list generator
+    # maximal range is up to CO (kernel)
+    tile_bnx_candidates = range(1, CO+1)
     factors = get_factors(CO)
-    cfg.define_knob('tile_bna', [x for x in tile_bna_candidates if x in factors])
-    cfg.define_knob('tile_bnb', [1, 2, 4, 8, 16])
-    cfg.define_split('tile_t1', CI, num_outputs=2, max_factor=128)
-    cfg.define_split('tile_t2', CO, num_outputs=2, max_factor=128)
-    cfg.define_split('c_unroll', CI, num_outputs=2, max_factor=8)
-    cfg.define_knob('yt', [1, 2, 4, 8, 16, 32])
+
+    ##
+    ## BNA space
+    # factors of maximal CO (kernel) size
+    cfg.define_knob('tile_bna', [x for x in tile_bnx_candidates if x in factors])
+
+    ##
+    ## BNB list generator
+    #
+    tile_bnb = []
+    # account all tile_sizes
+    for t_size in range (2,9):
+        # P as volume of tensor @data
+        # (N=1) means P is just square
+        nH = (H + t_size-1) // t_size
+        nW = (W + t_size-1) // t_size
+        P = N * nH * nW
+
+        P_list = []
+        # maximal range is up to CO (kernel)
+        for bnb in range(1, max(CO, CI)+1):
+            # bnb as square should fit into to P
+            P_round = (P + bnb - 1) // bnb * bnb
+            assert P_round % bnb == 0
+            # search unique P_round//bnb fits
+            if P_round//bnb not in P_list:
+                P_list.append(P_round//bnb)
+                # store the bnb who generated
+                # the unique P_round//bnb subsquare
+                if bnb not in tile_bnb:
+                      tile_bnb.append(bnb)
+    ##
+    ## BNB space
+    # all possible subsquares fitting
+    cfg.define_knob('tile_bnb', tile_bnb)
+
+    cfg.define_split('tile_t1', CI, num_outputs=2, max_factor=256)
+    cfg.define_split('tile_t2', CO, num_outputs=2, max_factor=256)
+    cfg.define_split('c_unroll', CI, num_outputs=2, max_factor=32)
+    cfg.define_knob('yt', [1, 2, 4, 8, 16, 32, 64, 128, 256])
     ##### space definition end #####
 
     if cfg.is_fallback:

cbalint13 · August 1, 2019, 3:45pm

@merrymercy,

Keen to post this:

One outstanding sample obtained in less than 120 iteration with proposed PR #3642 + patch_here addressing P and expanding t1,t2,unroll,yt a bit:

before=98GFOPS
after=303GFLOPS
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {98.48 GFLOPS /winograd} Current/Best:  288.51/ 303.59 GFLOPS | Progress: (120/2000) | 349.65 s

Will re-tune all mali TOPHUB entry and post results (will take a while), it seems much better results can be obtained on float32 and float16 too.

cbalint13 · August 2, 2019, 2:09am

Proposed schema is wrong (results & numbers too). Lets remain at actual form of our PR. There good parts in this thread porposal that improves but not that much, will settle all clear if re-tune more kernels from TOPHUB and see exact extends and improvements limit.