TVM and BLAS libraries

Hi,

I’m trying to benchmark TVM built with different BLAS libraries. I have a test model that I compiled and ran after building TVM with all the different BLAS configurations offered in config.cmake. The point is that inference time is the same, no matter which BLAS library TVM has been built with. Is this normal? I was expecting to observe different performance for every different BLAS library used.

Did you specify the library in the target (e.g., cuda - libs =cudnn, cublas)?

@comaniac I did not, I didn’t find any documentation regarding the need of specifying the library in the target. Let me try.

Try to follow this tutorial. For CPU, you should be able to use llvm -libs=cblas if I remember correctly.

@comaniac Thanks for the reference to the tutorial! If I specify -libs=cblas in the target, inference time gets worse (~25%). What does -libs=cblas flag imply?

Which BLAS library do you use? If you use MKL, it might be due to the conflict between TVM threadpool and OpenMP threadpool. You can probably try to enable OpenMP threadpool in the cmake config file.

@haichen actually I’m trying to test your pr with MKL-DNN backend. After building TVM with OpenMP support, inference time went back to the original value (the same as target w/o any external lib).

But even when building TVM with standard MKL, the performance is exactly the same.

If I build TVM with openblas or atlas and specify -libs=cblas, then inference speed gets worse than baseline (~3%), while I don’t see any change when building with MKL or MKL-DNN.

You might want to experiment with environment variables that impact MKL. For example, OMP_NUM_THREADS, KMP_BLOCKTIME, OMP_NESTED, etc.

@jonso My inference is running single-threaded, I don’t think those env variables would help, right?

@haichen is it normal to get exactly the same performance w/ and w/o specifying -libs=cblas when TVM is built with MKL support? Maybe MKL isn’t used at all?

@comaniac if no external library is specified in the target, which backend does TVM use?

I don’t think you will see much difference with a single thread. Libraries like MKL are built to exploit parallelism.

Personally, I see a significant difference with and without -libs=cblas when using multiple threads.

1 Like

It depends. What’s your workload? Is that just a dense op or an entire network? Dense may not be the performance bottleneck if you profile an entire network so that the impact of using CBLAS would be moderated. I did an experiment months ago using 512x512 matrix to perform dense with and without CBLAS. The one with CBLAS is ~1.25x faster.

If no external library specified (e.g., llvm), TVM will generate LLVM IR and then lowers to the machine code directly. In this case, it’s suggested to specify at least the CPU model to make TVM use AVX instructions. For example: llvm -mcpu=skylake-avx512.

@comaniac The workload is a complete network, ResNet-like. There’s only one dense layer at the very top. When building TVM with openblas/atlas and enabling -libs=cblas performance gets worse by ~3%. When building TVM with MKL, there’s no difference between performance w/ or w/o -libs=cblas. This behaviour is kind of unexpected imho.

By the way, I’m already specifying -mcpu in the target.

For most neural-network cases, we would expect tvm to work better than blas cases if the workload is not typical, as the generated code can benefit from things like operator fusion and shape specific tuning.

1 Like

@tqchen Got it, thank you! In my case, I’d still expect to experience different performance when using blas compared to TVM baseline, but this doesn’t happen when using MKL or MKL-DNN blas, almost like those blas library aren’t used or computation falls back to TVM one.

Any other comment on this?

@haichen can you share what kind of performance improvement you obtained by using MKL-DNN backend?

From my experience on BERT base model performance, on EC2 C5.4xlarge instance, I can reduce the latency from 93ms to 52ms by just changing the dense from topi implementation (after tuned by AutoTVM) to MKL-DNN with OpenMP thread pool. In particular, the total latency of all dense ops in BERT reduces from 65.7ms to 29.2ms.

1 Like

This just by enabling MKL-DNN backend when building TVM and adding -libs=cblas in the target?