I have written a INT4 GPU Conv2d and the single operator can run correctly, Then I tried to run the resnet50 with this newly developed Conv2d, but it seems that when running the model, some other quantization operators are fused into Conv2d, like “fused_nn_conv2d_add_right_shift_clip_cast_17”, and the running failed with the error
"ValueError: Direct host side access to device memory is detected. Did you forget to bind?"
The generated code also has some problems:
# Output of GPU Conv2d
produce Out {
// attr [iter_var(blockIdx.z, , blockIdx.z)] thread_extent = 784
// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 16
for (n.inner.o.inner.fused.outer.outer.outer.outer, 0, 2) {
// attr [iter_var(threadIdx.z, , threadIdx.z)] thread_extent = 1
// attr [iter_var(threadIdx.y, , threadIdx.y)] thread_extent = 1
// attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
Out[(((((n.inner.o.inner.fused.outer.outer.outer.outer*401408) + (floordiv(threadIdx.x, 8)*100352)) + (blockIdx.y*6272)) + (floormod(threadIdx.x, 8)*784)) + blockIdx.z)] = Conv[((((blockIdx.z*1024) + (blockIdx.y*64)) + (n.inner.o.inner.fused.outer.outer.outer.outer*32)) + threadIdx.x)]
}
}
# Cast on CPU accesses "Out" on GPU
produce T_cast {
for (ax0, 0, 8) {
for (ax1, 0, 128) {
for (ax2, 0, 28) {
for (ax3, 0, 28) {
T_cast[((((ax0*100352) + (ax1*784)) + (ax2*28)) + ax3)] = int4(max(min(shift_right((Out[((((ax0*100352) + (ax1*784)) + (ax2*28)) + ax3)] + 16), 5), 7), -7))
}
}
}
}
}
}
In produce T_cast, it accesses a GPU memory “Out” in CPU code, so I guess the problem is that the fusion operation fused some CPU operators into GPU operator and the CPU operator accesses the output of the GPU operator directly. I have no idea what to do about it, would be appreciated it if anyone could take a look and give me some suggestions