Running time of same workload (using same op) varies greatly in different networks

Background

I reproduced the code of the blog to test bert model performance on CPU, very impressive.

Then I profiling the generated lib to inspect the details run time of final operators with the tvm.contrib.debugger.debug_runtime. part of result lists blow:

Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs

fused_nn_dense_add_25 | fused_nn_dense_add_2 | 390.445 | 5.959 | (32, 3072) | 3 | 1

fused_nn_dense_add_2 is a fused op(matmul+bias-add).

Then, I write a simple model contains only one matmul+bias-add with same workload, profiling result:

Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs

fused_nn_dense_add | fused_nn_dense_add | 145.383 | 100.0 | (32, 3072) | 3 | 1

Two tasks run on same machine with same parameter(OMP_NUM_THREADS=20 numactl -m0 -N0).

Questions

Why the run time of the same op with the same workload varies greatly (390us vs 145us)? where is the overhead?

Thanks.

Is there some way to get more information?