How to optimize dequantize_gemv in hexagon dsp

wen · March 1, 2024, 8:27am

Hello everyone! I’m trying some experiments on this connection. Currently I’m hitting a bottleneck

How to optimize dequantization gemv on hexagon dsp to achieve higher efficiency

wen · March 1, 2024, 8:35am

I modified some test scripts and the connections are as followshexagon_dequant_gemv

目前的结果:

[1*2048] * [2048,2048] = 92us

dequantize_gemv = 2 ms

Regarding dequantization, part of the schedule may need to be optimized. I’d be grateful for any advice,thanks

wen · March 1, 2024, 8:39am

@ kparzysz Can you give me some ideas to learn from? Thank you.

Hzfengsy · March 1, 2024, 11:42am

Allocate weight matrix in vtcm is not available for e2e models. As we somehow need to move data from global(DDR) to vtcm(SRAM). It can be a separate step, but also can be part of the computation. But anyway we must to count the time during evaluation.

wen · March 1, 2024, 12:01pm

HI ，Hzfengsy

dequantize = T.alloc_buffer((2048, 2048), “float16”, scope=“global.vtcm”)

I tried to have the dequantized variable generated inside the function, but I don’t know much about how to control the lifetime of the variable inside the function.

For example, if the lifetime of the variable can be controlled inside the function, I can use some temporary variables to control the vtcm variable. This may be more flexible for the use of vtcm.

Hzfengsy · March 1, 2024, 12:16pm

You are right. There are several points:

Your fp16 result is unfair, as the weight is neither generated by kernel nor can be allocated at vtcm;
In de-quantization case, the dequant weight could be in vtcm if we can control the storage scope, but the quanitzed weight can not
lifetime control can be applied in various places, either relax or fused tir. The easiest way should be in the fused function, the similar mechanism as GPU

wen · March 1, 2024, 12:36pm

Thanks for your reply, the modified code is here,update it should look fair now. The weights are loaded in ddr and passed to the kernel, which creates a vtcm buffer for later calculations.I will learn how to use sch later. The current scheduling optimization is still very simple

Noahschnapp · April 7, 2024, 8:04pm

Is there a method to allocate a weight matrix in VTCM for e2e models, considering the need to transfer data from DDR to SRAM? Can this be a separate step or integrated into computation, and how do we account for time during evaluation?

wen · April 17, 2024, 7:22am

Directly allocating in vtcm does not seem to work because vtcm is only 8Mb and it takes a long time to copy from global drr to vtcm. At present, I have tried to use l2 fetch to improve the cache hit rate. However, a dequantized gemv 2048*2048 takes about 200us, which is not very fast.