[VM] Slow Compilation of TF Object Detection Models

We have observed that TF object detection models have large compilation time. For example, SSD mobilenet takes around 15 min, and faster RCNN takes around 88 min on EC2 C5.9x machine (Skylake).

To pinpoint the issue, I profiled all the passes. Following is the breakdown for SSD mobilenet

#############
Top 10 passes
#############
LambdaLift	1	0.42128	0.42128
Inline	2	0.450635	0.2253175
AlterOpLayout	1	0.472836	0.472836
FuseOps	4565	2.517395652999993	0.0005514557837897027
SimplifyExpr	1	9.70095	9.70095
ToANormalForm	4562	18.614626426999994	0.004080365284305128
InferType	10143	56.48113732500007	0.005568484405501338
ManifestAlloc	3	134.5319	44.84396666666667
FoldConstant	5	478.08437000000004	95.61687400000001
EtaExpand	420	484.24537720000006	1.152965183809524
#############
Parser	194.3445200920105
Total (including parser)	947.3799571990967

Second column is number of invocations, 3rd is total time, and 4th is average time.

As you can see FoldConstant and EtaExpand take majority of time. FoldConstant is a function pass, which is called for every func in the module. Each FoldConsant called Interpreter, which calls EtaExpand. So, EtaExpand is counted twice.

The main culprit is CreateInterpreter in FoldConstant pass. CreateInterpreter makes a copy of almost whole mod. TF SSD models are pretty big, and cause performance overhead. But, the real slowdown comes from calling CreateInterpreter again and again, once for each func in the module.

@zhiics @haichen @kevinthesun @masahi

1 Like

Thanks @anijain2305 for the investigation.

cc @jroesch @MarisaKirisame @weberlo Not sure if you have seen the same problem or if you have any comment.

The slow parsing time could be due to repeated calls to infer_value and infer_type in the frontend. See the discussion in Incremental Type Propagation

The major potion of time still comes from VM compilation.

Yes, I don’t know what is happening in vm, but I wonder if we are not computing the same constant or inferring the same type over and over again (similar to what is happening in the frontend)

@masahi what you are suggesting might be possible here, I think we are making a copy and performing type inference again and again

VM ConstantFold subgraphs are big. They might share large portions and we might be throwing away all that TypeInference work while adding a new func in the mod.

Update the numbers and the text in the original post. Now, the passes are measured once per each module (earlier they were measured for each func in the module, causing total invocations to blow up).

Pushed a PR that cuts down the compilation time significantly, but it needs discussion if there is some other higher-level issues - https://github.com/apache/incubator-tvm/pull/6195

Thanks all. This major slowdown is addressed in the above PR.

Overall, the situation has improved considerably, but the compilation is still slow from a TVM user perspective. For example, mask RCNN and faster RCNN are taking over 20 minutes. This time, the bottlenecks are pretty clear. Top 2 contributors are TF parser and ManifestAlloc pass (possibly because it is in python). Printing the stats for mask_rcnn

#############
Top 10 passes
#############
Inline	2	1.069975	0.5349875
EliminateCommonSubexpr	1	1.27641	1.27641
tir.MakePackedAPI	265	1.746511539	0.006590609581132075
FuseOps	7894	3.5031981949999964	0.0004437798574866983
FoldConstant	5	15.117545	3.023509
EtaExpand	7890	18.173398643999974	0.0023033458357414414
ToANormalForm	7890	19.812965483999815	0.002511148984030395
SimplifyExpr	2	35.767915336	17.883957668
InferType	18823	175.7159563250003	0.009335172731498713
ManifestAlloc	3	249.8752	83.29173333333334
#############
Parser	442.8740212917328
Total (including parser)	1257.1966462135315
1 Like