Running time of same workload (using same op) varies greatly in different networks


I reproduced the code of the blog to test bert model performance on CPU, very impressive.

Then I profiling the generated lib to inspect the details run time of final operators with the tvm.contrib.debugger.debug_runtime. part of result lists blow:

Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs

fused_nn_dense_add_25 | fused_nn_dense_add_2 | 390.445 | 5.959 | (32, 3072) | 3 | 1

fused_nn_dense_add_2 is a fused op(matmul+bias-add).

Then, I write a simple model contains only one matmul+bias-add with same workload, profiling result:

Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs

fused_nn_dense_add | fused_nn_dense_add | 145.383 | 100.0 | (32, 3072) | 3 | 1

Two tasks run on same machine with same parameter(OMP_NUM_THREADS=20 numactl -m0 -N0).


Why the run time of the same op with the same workload varies greatly (390us vs 145us)? where is the overhead?


Is there some way to get more information?