Currently, float16 support for CUDA is incomplete - both functionally and performance-wise. There are few posts that suggest some ways to deal with the functional aspect, but these are not merged in yet. This post is for dealing with the second portion - Performance.
I was reading this paper - https://www.comp.nus.edu.sg/~wongwf/papers/hpec17.pdf
This one talks about
half data types.
half2 is basically
float16x2. It seems that we can speedup using FP16 on CUDA only when we use
half2 datatype, signaling the hardware to performance two float16 operations simultaneously.
Has anybody prototyped this before? Or has idea how to make this happen?