[RFC] Discuss TVM v0.7 Roadmap

ZhennanQin · January 20, 2020, 7:22am

We can consider to support Bfloat16 data type in native. More and more hardware will support this type since this year. Because it’s not a standard type in generic programming language(c++ / numpy) and traditional compiler (gcc / llvm), developer is facing a challenge of writing kernel with it. It’s a huge advantage if TVM can generate Bfloat16 kernel efficiently, and TVM is designed to do this well.

tqchen · January 20, 2020, 5:40pm

Thanks for all the suggestions so far, we will summarize the roadmap near the end of the month. The current goal is to aim for a three month timeframe (April).

Besides the set of features we want to see, we would love to what our community members want to work on(either new proposals) or some of these existing ones. It would help us coordinate and estimate feasibilities of these items

ziheng · January 22, 2020, 1:07am

I will mainly work on the automated quantization project in the next three months, see the RFC for details:

In summary, I hope that with this project we can provide a easy-to-use quantization tool for users, which can be adapted to different models, and different hardwares quickly.

ljxing · January 24, 2020, 6:17pm

It would be very helpful to support dynamic batch size.

cbalint13 · January 27, 2020, 7:16pm

FYI:

tqchen · February 8, 2020, 7:29pm

Thanks everyone for a great discussion, a draft roadmap is posted here https://github.com/apache/incubator-tvm/issues/4845

redpanda3 · February 9, 2020, 2:45am

Can you add nvdla in tvm/vta too, as a milestone?

yzhliu · February 10, 2020, 6:43pm

I see an increasing demand for replacing Relay recursive visitor/mutator by non-recursive ones (due to stack limit). Would you think it is doable in v0.7?

MarisaKirisame · February 11, 2020, 1:07am

I am against this idea. Let me explain.

It is definitely doable by using continuation passing style/trampoline.

However, they require rewriting code in a much uglier manner. Also, as most call are not tail call, there wont be much memory saving, and we had only trade stack space for discontinuous heap space.

A better solution imo is to call setrlimit(RLIMIT_STACK, &R);

MarisaKirisame · February 11, 2020, 1:10am

How is GNN going? I am thinking of doing some program optimization in GNN training.

yzhliu · February 11, 2020, 1:21am

but increasing stack size might be against internal security policies.

MarisaKirisame · February 11, 2020, 1:27am

There is the raw IRVisitor which doesnt recurse. The difficulty is migrating all the pass… I imagine whoever is against setrlimit can help by migrating the pass.

yzhliu · February 11, 2020, 1:30am

How about we aim to have at least an RFC and an infra landed in the next cycle of release?

tqchen · February 11, 2020, 2:37am

cc @jroesch @mbrookhart , I agree that it is important to introduce non-recursive version for most cases, in particular the PostOrderRewriting case where we can visit the dataflow part of the Expr and use a callback to rewrite them. As long as we manually manage the stack, there won’t be stack overflow problem

junrushao · February 11, 2020, 2:45am

I agree that it is super ugly (and great amount of work) to migrate, but it is existential problems when we want to optimize a medium-sized network. I also agree that setrlimit is a good workaround if I am working alone on my personal laptop. However, industrial issues may require a potentially different solution, as @yzhliu has mentioned

If there is better approach (less ugly, less amount of work) to manually manage the stack, I think I would vote for it. So, why not think about it

MarisaKirisame · February 11, 2020, 4:42am

I am open to better solution then CPS. I just personally dont know any.

vv1133 · March 3, 2020, 11:23am

I hope a name argument can be added to Relay ops. Absence of a name makes debugging difficult and losing connection with ops in frontend frameworks.

caimogu · June 6, 2020, 8:33am

Hi, I fork the https://github.com/GaryYuyjl/incubator-tvm/tree/int4tensorcore for int4 computation with tensorcore. I found it cost too much time while packing int4 to int32 with cpu. So I write the pack progress into conv2d compute&schedule and get good results. But the packing data time still takes up at least 30% of the total convolution time. It may because my compute&schedule code is bad. Do you hava any good suggestion about efficiently packing data?