Explore Optimizations for Concat

tqchen · May 5, 2019, 10:20pm

There are several observations by @antinucleon @vinx13 @kevinthesun suggesting that the current way of handling concatenation might not be optimal. Specifically, currently, we try to fuse as much as possible and still tries to use the same data parallel code generators to generate concat. This will result in if_then_else or switch expressions that are not necessarily the fastest. Interestingly some of these could become bottlenecks.

We can at least come up with a few alternatives:

Mark concat as opaque and directly generate code that copies into the target region
- Skip concat via no-op and see how much difference we can get
Special handle concat, by making use of Buffer bind semantics to especially generate a number of kernels that directs copies into the target region.

This thread is for some discussions as well as possible experimental results people could provide to see how expensive concat are and what gains we can get by using these alternatives.

janimesh · May 7, 2019, 1:28am

Thanks for the post. I also experienced similar problems with concat.

Is it somehow possible to introduce a TensorView kind of abstraction, such that we can pass on a subset of Tensor instead of creating a new space for the concat operator. I am thinking also for other memory operators like reshape, expand_dims, squeeze etc that do not change the memory contents, but only change the way we look/read the data contents for the subsequent operator.

yinghai · May 7, 2019, 6:34pm

Thanks for raising this. Currently solution of Concat is not ideal due to its recursive nature. And may result in stack overflow if number of inputs is large. I saw repeating stacktrace patterns like follows for roughly each input:

#166 0x000000000c8f077e in std::function&lt;HalideIR::Internal::Stmt (HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::operator()(HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*) const (this=0x7ffff211a420, __args#0=0x7fff3700f020, __args#1=..., __args#2=0x7fff745ea540) at ../libgcc/include/c++/7.3.0/bits/std_function.h:706 
#167 0x000000000c8ec4f7 in tvm::IRFunctor&lt;HalideIR::Internal::Stmt (tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::set_dispatch&lt;HalideIR::Internal::LetStmt&gt;(std::function&lt;HalideIR::Internal::Stmt (HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;)::{lambda(tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)#1}::operator()(tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*) const (this=0x7ffff211a420, n=..., args#0=..., args#1=0x7fff745ea540) at tvm/tvm/3rdparty/HalideIR/src/tvm/node/ir_functor.h:108 
#168 0x000000000c8f9ef3 in std::_Function_handler&lt;HalideIR::Internal::Stmt (tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*), tvm::IRFunctor&lt;HalideIR::Internal::Stmt (tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::set_dispatch&lt;HalideIR::Internal::LetStmt&gt;(std::function&lt;HalideIR::Internal::Stmt (HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;)::{lambda(tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)#1}&gt;::_M_invoke(std::_Any_data const&amp;, tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*&amp;&amp;) (__functor=..., __args#0=..., __args#1=..., __args#2=@0x7fff74502f18: 0x7fff745ea540) at ../libgcc/include/c++/7.3.0/bits/std_function.h:302 
#169 0x000000000c636b74 in std::function&lt;HalideIR::Internal::Stmt (tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::operator()(tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*) const (this=0x7ffff2031ba0, __args#0=..., __args#1=..., __args#2=0x7fff745ea540) at ../libgcc/include/c++/7.3.0/bits/std_function.h:706 
#170 0x000000000c63661d in tvm::IRFunctor&lt;HalideIR::Internal::Stmt (tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::operator()(tvm::NodeRef const&amp;, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*) const (this=0x2a97f0b0 &lt;tvm::ir::IRMutator::vtable_stmt()::inst&gt;, n=..., args#0=..., args#1=0x7fff745ea540) at tvm/tvm/3rdparty/HalideIR/src/tvm/node/ir_functor.h:76 
#171 0x000000000c634745 in tvm::ir::IRMutator::Mutate (this=0x7fff745ea540, stmt=...) at tvm/tvm/include/tvm/ir_mutator.h:44 #172 0x000000000ca52fa8 in tvm::ir::IRUseDefAnalysis::Mutate_ (this=0x7fff745ea540, op=0x7fff3700f050, s=...) at tvm/tvm/src/pass/split_host_device.cc:53 #173 0x0000000012e296ee in tvm::ir::&lt;lambda(const HalideIR::Internal::LetStmt*, const HalideIR::Internal::Stmt&amp;, tvm::ir::IRMutator*)&gt;::operator()(const HalideIR::Internal::LetStmt *, const HalideIR::Internal::Stmt &amp;, tvm::ir::IRMutator *) const (__closure=0x7ffff211a420, op=0x7fff3700f050, s=..., m=0x7fff745ea540) at tvm/tvm/src/pass/ir_mutator.cc:310 
#174 0x0000000012e2eb47 in std::_Function_handler&lt;HalideIR::Internal::Stmt(const HalideIR::Internal::LetStmt*, const HalideIR::Internal::Stmt&amp;, tvm::ir::IRMutator*), tvm::ir::&lt;lambda(const HalideIR::Internal::LetStmt*, const HalideIR::Internal::Stmt&amp;, tvm::ir::IRMutator*)&gt; &gt;::_M_invoke(const std::_Any_data &amp;, const HalideIR::Internal::LetStmt *&amp;&amp;, const HalideIR::Internal::Stmt &amp;, tvm::ir::IRMutator *&amp;&amp;) (__functor=..., __args#0=@0x7fff74503158: 0x7fff3700f050, __args#1=..., __args#2=@0x7fff74503148: 0x7fff745ea540) at ../libgcc/include/c++/7.3.0/bits/std_function.h:302 
#175 0x000000000c8f077e in std::function&lt;HalideIR::Internal::Stmt (HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*)&gt;::operator()(HalideIR::Internal::LetStmt const*, HalideIR::Internal::Stmt const&amp;, tvm::ir::IRMutator*) const (this=0x7ffff211a420, __args#0=0x7fff3700f050, __args#1=..., __args#2=0x7fff745ea540) at ../libgcc/include/c++/7.3.0/bits/std_function.h:706

Mark concat as opaque and directly generate code that copies into the target region

Seems a good candidate solution to me.

kevinthesun · May 7, 2019, 7:44pm

Agree that we need some benchmarking to decide the best solution.

tqchen · May 9, 2019, 3:37pm

would be great if we can get some volunteers to look into these possibilities

Laurawly · May 30, 2019, 4:25pm

Concatenation op makes gluoncv SSD hangs for a long time (more than 5 minutes) before returning the results on Mali GPU. I made the concatenation ops opaque and the program finished in around 400 ms.

tqchen · June 7, 2019, 6:29pm

I have some new thoughts on how we could attack concat. Ideally, we want to generate several for loops(equal to the length of the input tuple) instead of one for loop and use selection. Here is how we possibly achieve this through code transformation.

Specifically, we could introduce an intrinsics tvm_axis_switch (name can be discussed), and we can have loops like

for (i = 0; i < 100; ++i) {
   B[i] = tvm_axis_switch(i, 0, 20, 40, 100, A0[i], A1[i-20], A2[i-40])
}

The semantics is pretty clear, we are trying to concat A0,A1, A2, and tvm_axis_switch indicate that we are trying to switch on a possibly loop variable i, and try to look into the corresponding ranges.

Then we write a pass to SplotAxisSwitch, which try to pattern match this loop pattern, and split the loop into several ones. Of course, to keep things simple, we could require that i is indeed a loop variable and the range matches the range of the serial loop. If the pattern detection failed, we fall back to if_then_else

That means we need to have a special OpPattern for concat(InputFusableOutputElemwiseFusable), which allows fuse of injective ops in the input, and elementwise op in the output. And a special schedule that leave the axis of concat alone(so it is a simple loop and allows the followup optimization).

tqchen · June 7, 2019, 6:31pm

cc @hlu1 @ajtulloch who might also be interested. Would also like to see everyone’s thoughts and if anyone is interested in taking a stab on this

hlu1 · June 7, 2019, 7:00pm

We have found a simple workaround in the case of concatenating 2D tensors (currently our most common use case). By unrolling the last axis, llvm is smart enough to generate vectorized code and the performance is even better than c code in caffe2. For benchmark numbers, see https://gist.github.com/ajtulloch/d3b47517721c71c09375fd76f387e718 from @ajtulloch.

tqchen · June 13, 2019, 4:07pm

would be nice if you have bandwidth to follow up on this

hlu1 · June 13, 2019, 8:01pm

Sounds good. Will do.

hlu1 · June 18, 2019, 10:59pm

One disadvantage of unrolling concat is that it can increase the compilation time significantly if there are a lot of concats with many inputs. On one model we tested, it takes about 10min, instead of several seconds without unrolling. Basically we’re trading off compilation time for run time. @tqchen’s tvm_axis_switch approach might be a better alternative overall.

FrozenGene · June 20, 2019, 3:31am

Yes. I also meet this problem. One workaround maybe is to limit the max_unroll be 16 like our conv2d.py on ARM CPU.

FrozenGene · June 20, 2019, 3:32am

According to my test, simply set opaque for concat could also work well.

Laurawly · June 28, 2019, 6:50pm

@hlu1 @tqchen Do we have some updates on this?

hlu1 · June 28, 2019, 8:42pm

We decided to go with a memcpy based version (with tvm.extern) internally because it’s simple and works for our use cases (single threaded inference). I’m happy to upstream the implementation if it’s useful for the community.

tqchen · July 6, 2019, 9:41pm

We might want to avoid memcpy for different devices types like GPU have different memcpy API. We could start with a basic extern scripted for loops that works for both GPU and CPU, then move on to support the axis_switch based version

sxjscience · August 3, 2019, 5:03pm

I will try to implement the axis_switch solution.

xqdan · April 19, 2022, 1:28pm

Another input for optimizing concat.

see code below:

code 1:

A = op1(in1)
B = op2(in2)
C = concat(A, B)
D = op3(C)

code2:

alloc C
C[0] = op1(in1)
C[1] = op2(in2)
D = op3(C)

With code2, we can better leverage cache. C in code1 is concated in ddr and also C is new to the cache, op3 need load C from ddr. But in code2, C is wrote by op1 and op2, so C is in cache, so we can save concat, also get a cache hit.

@tqchen @FrozenGene @Hzfengsy

wrongtest · April 20, 2022, 5:35pm

There is an example on split concat axis via loop partition. Together with some inplace schedule primitives we may be able to result to the form suggested by @xqdan

github.com

apache/tvm/blob/58b7a5a268435c34eca36f6c0394d9548b850f98/tests/python/unittest/test_tir_transform_loop_partition.py#L611


    for i1 in T.serial(128, annotations={"pragma_loop_partition_hint": 1}):
        for i2, i3 in T.grid(28, 28):
            if 96 <= i1:
                T_concat[i1 * 784 + i2 * 28 + i3] = placeholder_2[i1 * 784 + i2 * 28 + i3 - 75264]
            if 64 <= i1 and i1 < 96:
                T_concat[i1 * 784 + i2 * 28 + i3] = placeholder_1[i1 * 784 + i2 * 28 + i3 - 50176]
            if i1 < 64:
                T_concat[i1 * 784 + i2 * 28 + i3] = placeholder[i1 * 784 + i2 * 28 + i3]




def test_condition_mutually_exclusive():
    mod = IRModule.from_expr(concat_func_3)
    with tvm.transform.PassContext(config={"tir.LoopPartition": {"partition_const_loop": True}}):
        mod = tvm.tir.transform.FlattenBuffer()(mod)
        mod = tvm.tir.transform.LoopPartition()(mod)
        mod = tvm.tir.transform.Simplify()(mod)
        mod = tvm.tir.transform.RemoveNoOp()(mod)
    assert tvm.ir.structural_equal(mod["main"], partitioned_concat_3)




if __name__ == "__main__":