[VTA][Chisel] Redundancy of having 32-bit Accumulator matrix in GEMM

There seems to be no need for the GEMM module to have an accumulator matrix with 32-bit integer values.

When I looked into the TensorGemm.scala code, I observed that when the values are being stored from the accumulator matrix to the output matrix, the datatype of the values are truncated from int32 to int8. This seems redundant because the same output can be achieved even when the accumulator values were of type int8.

There seems to be some inadequacies even in the matrix multiplication tutorial. In the tutorial, it is said that the accumulator width is set to 32 in order to avoid overflow during accumulation. However in the same tutorial, due to memory store restrictions of the VTA architecture, the output matrix can only be stored into DRAM when it’s datatype format is the same as the input (int8). Hence, the values in the accumulator matrix are truncated from int32 to int8, which eventually leads to the aforementioned overflow issue thus making the claim of having the accumulator size to be 32-bits to be redundant.

This redundancy can be overcome by setting the bit width of the accumulator matrices to int8. This would give the same results as the current GEMM module. This would help in reducing the overall space needed to store the accumulator by 4x times, as well as improves performance of both multiplication and subsequent pipelined addition.

In a more general sense the datatype of all matrices can be set to a single value (equal to the output datatype) for the sake of simplicity and functionality.

Just to double check, do you mean the design should use INT8 as accumulator?

AFAIK, accumulator matrices are only instructed to shift a number of bits by using a SHR instruction.

Yes.

When two int8 input numbers are multiplied or added, taking care of the overflow generated by the carry would be irrelevant if the output is eventually going to be stored as an int8 value. Thus, having an accumulator of type int8 in the GEMM module would still give the same results as that of type int32.

I am referring to the accumulator matrices used in the GEMM module for storing intermediate results. In the Chisel implementation it is the acc_i and acc_o matrices in the module MatrixVectorMultiplication in TensorGemm.scala

@BharathKinnal Thanks for reporting. Now I come to understand what is the problem.

I think we need to adjust the data width according to the hyper-parameters.