Background
I reproduced the code of the blog to test bert model performance on CPU, very impressive.
Then I profiling the generated lib to inspect the details run time of final operators with the tvm.contrib.debugger.debug_runtime
. part of result lists blow:
Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs
fused_nn_dense_add_25 | fused_nn_dense_add_2 | 390.445 | 5.959 | (32, 3072) | 3 | 1
fused_nn_dense_add_2
is a fused op(matmul+bias-add).
Then, I write a simple model contains only one matmul+bias-add with same workload, profiling result:
Node Name | Ops | Time(us) | Time(%) | Shape | Inputs | Outputs
fused_nn_dense_add | fused_nn_dense_add | 145.383 | 100.0 | (32, 3072) | 3 | 1
Two tasks run on same machine with same parameter(OMP_NUM_THREADS=20 numactl -m0 -N0
).
Questions
Why the run time of the same op with the same workload varies greatly (390us vs 145us)? where is the overhead?
Thanks.