[ TOPI ] Winograd convolution performance is too slow

Hello!

Currently, I am testing to compare the performance of direct conv2d and winograd conv2d using TOPI. However, as a result of experiments, conv2d using winograd algorithm is too much worse than direct. The code below is the code I experimented with.

## data shape
data_shape = (1,3,224,224)
w_shape = (64,3,3,3)

## Data
sample_data = np.random.uniform(-1,1, size=data_shape ).astype("float32")
sample_p1 = np.random.uniform(-1,1, size=w_shape ).astype("float32")

## placeholder
input_data = tvm.te.placeholder( shape = data_shape, dtype = "float32", name="Input" )
p1 = tvm.te.placeholder( shape = w_shape, dtype="float32", name="p1" )

## Winograd conv2d
with tvm.target.create('cuda'):
    conv = topi.cuda.conv2d_nchw_winograd(input_data
                                          ,p1 
                                          ,(1,1)
                                          ,(0,0)
                                          ,(1,1)
                                          ,"float32"  )
    sch = topi.cuda.schedule_conv2d_nchw_winograd([conv])
    winoMod = tvm.build( sch, [ input_data,p1,conv] , target, name='wino')

## Direct conv2d
with tvm.target.create('cuda'):
    conv = topi.cuda.conv2d_nchw( input_data
                                    ,p1 
                                    ,[1,1]
                                    ,[0,0]
                                    ,[1,1] )
    sch = topi.cuda.schedule_conv2d_nchw([conv])
    simpleMod = tvm.build(sch, [input_data,p1], target, name='direct' )


## Real data
tvm_input = tvm.nd.array( sample_data , ctx )
tvm_p1 = tvm.nd.array( sample_p1, ctx )

## Performance Testing
ev_wino = winoMod.time_evaluator(winoMod.entry_name, ctx, number=1,repeat=100 )
ev_conv = simpleMod.time_evaluator(simpleMod.entry_name, ctx, number=1,repeat=100 )

timer = ev_conv( tvm_input, tvm_p1).mean*1e3
print("Conv with Direct algo -> ",timer)
timer = ev_wino( tvm_input, tvm_p1).mean*1e3
print("Conv with Winograd Strassen algo -> ",timer )

The execution result is as follows.

Conv with Direct algo ->  0.11522044
Conv with Winograd Strassen algo ->  4.70840109

The performance gap is too big. According to the Fast Algorithms for Convolutional Neural Networks paper, I think performance is higher or similar than direct conv2d. Is there something I misunderstood?

try bigger number of channels. Winograd is slow for small channels.

1 Like

i tried bigger channel img and weight like below.

img_shape = ( 1,512,224,224 ) , w_shape = (256,512,3,3 )

shape format is a NCHW.

and the result is

direct => 50.641ms
winograd => 604.84ms

the performance was not better than direct conv2d… Should I use more channels?

i conducted additional experiments.

When using Conv 1.2 layer of VGG-16 network, according to the paper, performance should be better than direct conv2d.

But the result is that direct conv2d is better.

input_img shape = (1,64,224,224)  ## NCHW Format
weight shape = (64,64,3,3)
bias shape = (64,)
Conv with Direct algo ->  0.563ms
Conv with Winograd Strassen algo ->  22.261ms

The winograd performance is too low than direct conv2d. is it normal?

Have you used AutoTVM to tune the winograd template? The default schedule could be slow.

1 Like

thank you for the reply haichen! i will try it.

I’m curious that the two convolutions are using fallback configuration not tunned by autoTVM. However, winograd is slow, with almost 200 times the performance difference. In general, does winograd perform poorly before tunning?

I didn’t compare these two implementation using fallback config before. But based on your observation, I think it’s the case that winograd fallback config does perform poorly.

1 Like

Thanks for the advice. haichen!

I think i should try optimizing winograd using autotvm. Thank you.

Hi, I am now trying to use AutoTVM to tune the templates for different operators, but I do not know where exactly those templates which are already implemented by tvm are. Could you help me on this? Also, are there any examples or tutorials on how to use these templates?

And from the Tuning High Performance Convolution on NVIDIA GPUs tutorial, I find that the search space is defined with tile knobs and unrolling knobs, are there templates with more kinds of knobs?

Thank you in advance.

The implementations for CUDA are defined in topi/python/topi/cuda. The strategy to select implementations for conv2d op is defined at here.

I don’t understand your second question very well. Do you want to know how to write an AutoTVM template?

Thank you for your reply.

Which function in AutoTVM will use this strategy?

I want to ask how to use one template which is implemented by TVM to autotune an operator, and someone already helped me on this, but I still have a question, and maybe you can help.

By reading the documents and some code, I find that there are four functions which can define the config_space: define_split, define_reorder, define_annotate, define_knob. I also find that define_split, define_knob are the most used. Is it because loop tiling and loop unrolling is the most important or comman techniques to automatically optimize an operator?

Thank you in advance.