[RFC][DISCUSS] Non-Recursive AST Visiting

Most of the TVM’s IR infra involves recursive AST visitors and mutators. They are conceptually easy to deal with. However, relying on the recursive visitor also has the potential drawbacks of potential too deep recursion and risk of stack-overflow. One simple way to get around this is to reset the stack size limit; however, that may not be the most natural solution.

Given that TVM has a two-level lowering and each of function’s size is relatively contained as either macro-op or low-level AST for a fixed partition. Recursive visitors are not yet a problem and given that TVM is not at a low level as LLVM, it may not become a big problem. Never-the-less, it is good to bring this potential issue up and discuss possible ways to improve the stack if it becomes a problem.

Benefit of Recursive Visitor

First of all, recursive visitors are relatively developer friendly and are useful tools in major scenarios. It is beneficial to keep some level of support, at least for now. Because of the restricting factor is the depth of the AST, it is helpful to formalize normal form that is not deep in terms of nesting.

In cases like local Expr in the low level, given that we know the maximum depth of local expression won’t be large. Making use of recursive visitor is fine.

Possible Solutions

First of all, we could increase the stack size limit of the program. Many other workarounds to this problem are similar, except they are simulating some part of the stack with the heap(vector). Given that modern OS allocates physical pages lazily, setting the limit to a larger value will unlikely be different from the vector simulation alternative(you only used the page allocated during runtime).

One way to get around recursive mutation is to use the Transform/Rewrite API for most cases.

Expr PostOrderTransform(Expr input, function<Expr (Expr in)> frewrite);

The idea is that the rewriter will visit the AST graph in the post-order (we transform the ancestors of the node before work on the current one), then the rewriter’s job is to rewrite the current node to the desired version. This API will be useful for most cases, except that the rewrite has to be local and stateless.

A recursive visitor is in general good at context-dependent rewriting, because it is relatively easy to populate/remove the context during recursive calls.

Actions

We open this discussion thread to make everyone aware of the issue and discuss possible actions. At this moment we do not have to deal with the problem, and depending on our future state of AST and their potential depth, maybe or maybe not it will become a problem. It is worth to think about this problem.

4 Likes

Thanks for raising this. If TVM are used as a lib in a bigger service, sometimes it’s difficult to bump up strace limit as individual threads can have its own stack set programmably when created. In this case, simulated stack on heap might actually help. I understand that recursion is easier for people to write their own pass. Maybe we can fix a few important ones like visitor of LetStmt.

2 Likes

I run into a stack limitation issue about relay AST, from 9a3d2ec97cba92697a032ec32fecd122a34ecfa5 [NODE][REFACTOR] Rename IRFunctor->NodeFunctor, use func pointer (#4247)

I have a very deep model and it works before, but since this commit, it crashes in a VisitExpr_ call. It turns out it can be worked around by ‘ulimit -s 81960’, so I guess it is a stack overflow issue.

Yes. @yunjing_lh find that our quantized mobilenet v2 has the same issue in GCC 5, QNN lowers too many ops and has deep recursion. One workaround is to increase stack size. However, non-recursive ast visiting should be one better way.

1 Like

se latest thread here Performing Relay Passes Non-Recursively the initial version of non-recursive visitor is already upstreamed