Float16 for Cuda - Performance

Currently, float16 support for CUDA is incomplete - both functionally and performance-wise. There are few posts that suggest some ways to deal with the functional aspect, but these are not merged in yet. This post is for dealing with the second portion - Performance.

I was reading this paper - https://www.comp.nus.edu.sg/~wongwf/papers/hpec17.pdf

This one talks about half2 vs half data types. half2 is basically float16x2. It seems that we can speedup using FP16 on CUDA only when we use half2 datatype, signaling the hardware to performance two float16 operations simultaneously.

Has anybody prototyped this before? Or has idea how to make this happen?

@vinx13 @ibeltagy @hhhh @xyzhou @ydy @tqchen @comaniac

1 Like

Generally there are two things:

  • Overriding codegen for half and half2 types for arithmetic operators *,+,…
  • Support vectorized type half2: in codegen for CUDA, map float16x2 to half2

Support vectorized type half2: in codegen for CUDA, map float16x2 to half2

So, schedules have to be changed, right?
Or can we reuse the same schedule and somehow for example, convert float16x8 to say (float16x2)x4

Not necessarily. It depends on how vectorized length is set (hard-coded or using autotvm)

Just to echo @janimesh note about performance, I ran some PyTorch code and the equivalent TVM-generated code and compared their float32 vs. float16 performance.
Switching to float16 gave the following speedups:
PyTorch: 1.88x
TVM-generated code: 1.17x (significantly lower than PyTorch)

Thanks @ibeltagy for sharing the observation. We need to deep dive into the TVM-CUDA schedules to understand this. Currently, I am not aware of what needs to go in to get speedup.

Unfortunately, I am busy with other portions of TVM project. Also, I am not familiar with CUDA schedules and codegen to quickly come up with list of tasks.

It might be very helpful if we can get someone who has worked on CUDA schedules to come up with a rough plan, and then we can parallelize the efforts.

@janimesh, any updates about this?

@vinx13, can you give more details how we might implement the two things you suggested earlier ?

Sorry @ibeltagy I am busy with some other portions, and have not gotten anytime to think about this.

1 Like

The first thing (codegen) is already supported. For the second part, current topi operators support fp16, but does not guarantee good performance. One thing we need to do is to support vectorized type half2. We need to update codegen for cuda to generate half2 type in the code

1 Like

can you give a concrete example how to override one operation, and I can follow your example to override the rest.

We can start with some simple examples to get them working.
Here are an example using half2

import tvm
n = 16
a = tvm.placeholder((n,), dtype='float16')
b = tvm.placeholder((n,), dtype='float16')
c = tvm.compute((n,), lambda i: a[i]+b[i])
s = tvm.create_schedule(c.op)
co, ci = s[c].split(c.op.axis[0], nparts=n//2)
s[c].bind(co, tvm.thread_axis('threadIdx.x'))
s[c].vectorize(ci)
f = tvm.build(s, [a, b, c], target='cuda')

This yields cuda code

extern "C" __global__ void default_function_kernel0( half* __restrict__ compute,  half* __restrict__ placeholder,  half* __restrict__ placeholder1) {
    float1 _1 = (( float1*)(placeholder + (((int)threadIdx.x) * 2)))[0];
    float1 _2 = (( float1*)(placeholder1 + (((int)threadIdx.x) * 2)))[0];
    float1 _3;
    _3.x = (_1.x+_2.x);
    _3.y = (_1.y+_2.y);
  (( float1*)(compute + (((int)threadIdx.x) * 2)))[0] = _3;
}

which is not correct. We need to update the rule in https://github.com/apache/incubator-tvm/blob/master/src/codegen/codegen_cuda.cc#L190

1 Like

Cool. Will check this in details later today. Thanks