[RFC][VTA] Support for Cloud Devices (OpenCL-compatible)

Motivation

Cloud devices are more powerful than Edge devices, which provides higher computation capabilities for deep learning workloads. For example, for the VTA core, with Cloud devices, we have more resources to support larger GEMM cores (e.g., 32*32 or even 64*64) and device buffers, thus making it possible to boost the performance to great extent. Therefore, it is worthwhile to provide a generic framework to support cloud devices under TVM/VTA architecture.

However, it is non-trivial to extend VTA to Cloud devices. Because the original Xilinx HLS VTA core only works on Xilinx Edge FPGA devices, and Cloud devices exposes different communication models (i.e., shared memory between ARM cores and FPGA device for Edge, vs., PCIe between host and FPGA device for Cloud), and different programming models. In this work, we propose to design a unified framework that can be adapted to any OpenCL-compatible hardware accelerators, e.g., FPGA, ASICs, to seamlessly work with the TVM-VTA architecture. Meanwhile, we provide an example of OpenCL-based VTA implementation that has been tested on the Intel’s high-end FPGAs.

Proposal

We would like to extend VTA to OpenCL-compatible devices (e.g. Intel Programmable Acceleration Card). In particular, we provide a framework where any OpenCL-compatible devices can be easily integrated. The reason we choose OpenCL-compatible devices are:

  • OpenCL is generic enough to support a group of devices. For example, both Xilinx and Intel are now in transition towards OpenCL based HLS approaches.
  • Vendor-specific optimizations are built-in within their respective OpenCL SDKs (e.g., pack two 8-bit multiply-add units into 1 DSP slice), but the framework we’re providing does not limit to specific SDKs.

In addition to the generic OpenCL framework, as a first attempt for the hardware implementation, we would like to base on Intel Cloud FPGA (e.g. Intel Programmable Acceleration Card) using Intel® FPGA SDK for OpenCL, which has proven portability and scalability for both Intel® Programmable Acceleration (PAC) cards and other custom Intel-FPGA-based acceleration cards. But the overall framework is generic, meaning that any OpenCL-compatible devices can be plugged in with only little extra hardware-specific implementation.

Major works

  • Efficient communication between host and PCIe devices as PCIe transmission is costly compared to memory copy

    • To avoid frequent PCIe copies, we propose to let all middle layers of a computation graph to completely run in FPGA devices, without interleaved CPU layers. In particular, originally, residual block in Resnet run in CPU (ARM cores), which may cause copy in and out from device memory frequently. The addition of extra VTA instructions are intended to move this kind of residual block to FPGA device.
    • Do copy of uops and instructions in a batch. In particular, only do synchronization after all on-device layers are queued, or queues are overflowed.
  • Support auto-copy between layers running on different devices. We propose to add a few more IR passes:

    • annotate device types for computation graph
    • tag and propagate device types among layers
    • add copy operations (device_copy) automatically if adjacent layers are not in the same devices
  • Driver development for OpenCL-compatible devices

    • The original pynq driver could not be used as we do not have direct access to h/w registers
    • We implemented a middle layer driver for OpenCL-compatible devices
    • The layer sits on devices’ native driver stack, which implemented an interrupt based device driver
  • OpenCL hardware implementation

    • Addition of extra Load/ALU instructions, such as Load int8 to ACC buffer (to support ALU-only nodes), ALU Multiply and Left-shift, to support more continued calculations on FPGA
    • Refactored the hardware implementation code to conform to Intel® FPGA SDK for OpenCL as a sample hardware implementation

Major changes to the existing TVM/VTA framework

  • To run a workload on cloud FPGA, there is no need to launch additional service on the device side (e.g., rpc server). All the driver and runtime programs are running in the host side.

  • Change VTA runtime to support batch queue synchronization. We intend to only queue the instructions/uops when running a layer and return immediately without doing device synchronization. We only do synchronization and device run when queues are overflowed or the next layer is not on-device。

  • We have to modify the device propagation behaviour from post DFS traversal to recursive method. Originally, device type is propagated based on the post DFS traversed graph, which may not be consistent if the argument order changes. In addition, it may handle some cases wrongly, e.g., the first residual block in Resnet50. The first few layers in Resnet50 are depicted in the following figure (top to bottom is in DFS order). Basically, we want to let all the layers run on FPGA device, except the first and last few layers. In the original device propagation algorithm, based on the post DFS order, the conv2d layers in grey will be propagated with CPU device type as we encounter copy2 first, following which the three grey conv2d nodes are marked as the source device type of copy2 (i.e., CPU), which is not correct.

Limitations

  • Virtual thread is not yet supported for intelfocl devices, so all instructions are running sequentially.
  • In the first version, we require all middle layers running on the FPGA. Thus some networks whose operations in these middle layers are not supported by hardware may not be supported, as it causes a mix of CPU and FPGA operations in-between and it is hard to be annotated with correct device types automatically. This restriction can also guarantee there are no frequent device copies between layers. We may relieve this restriction in the future versions.

What do you think about this feature? @tqchen @thierry

3 Likes

also cc @vegaluis @liangfu

ping @thierry

also cc @hjiang

Thanks @zhanghaohit for your proposal. It’s quite interesting to bring VTA framework into cloud devices. It seems to be this RFC brings a quite large topic. I’ve read through the proposed change, and still unclear about:

  • OpenCL requires multi-core parallelism, and we don’t have multi-core support in VTA for now. (The topic for bringing scalability to VTA has been discussed at [VTA] Scalability for data center FPGAs )
  • How to reuse current VTA hardware to communicate with TVM runtime through PCI-e interface?

As a side note, Xilinx HLS are quite different from Intel FPGA OpenCL in my observation. I think a more easy (and efficient) workaround is to reuse Chisel VTA for the PCI-e based FPGA, and implement PCI-e based driver for DMA. @vegaluis would have more experience on this.

See also:

Hi liangfu,

Thanks for your reply!

OpenCL requires multi-core parallelism, and we don’t have multi-core support in VTA for now.

Could you explain a bit more on this multi-core parallelism requirement of OpenCL? We intend to implement the proposed VTA core using single-work-item OpenCL kernels.

How to reuse current VTA hardware to communicate with TVM runtime through PCI-e interface?

A thin middle-layer driver is proposed to facilitate the communication between TVM runtime and the accelerator through PCIe.

Thanks!

Hi zhanghaohit,

thanks for this proposal, it is a very intersting topic, this proposal seems like be a very big change, but some parts of this proposal i am not quite understand and need your help for clarify,

first about the motivation part, this topic mentioned

#1 about “cloud device may use PCIE instead of memory share”, that make sense, but seems like a new driver with pcie support would can fix and no need such big change,

#2 about “different programming models”, could you help to give more detailed information about this part? do we have any plan to address scalibility issue for cloud fpga performance concern?

based on current information of motivation, it is little confused why we need to do this big change.

for “proposal” , this part mentioned “framework where any OpenCL-compatible devices can be easily integrated” and “Vendor-specific optimizations are built-in … SDK”, but “does not limit to specific SDK”, seems like the goal is to create a cross platform framework, this idea really awsome, but this 2 part “any OpenCL-compatible devices” and “vendor-specific optimization” are conflict, could you give more detail about what the plan here to balance this 2 parts and how to reduce related complexity to minus developer efforts?

for “major work”, about “To avoid frequent PCIe copies” “we propose to let all middle layers of a computation graph to completely run in FPGA devices”, about this part, I have couple questions first does that means this proposal would put all params data(input data, weights, bias) into FPGA sram one time? in such case if the model params size is bigger then FPGA capability how to handle such issue?

second, the data transfer may cause big latency, could I know do we have any solution to hiding the memory latency?

third, even with PCIE device, DMA should still work, could I know some detail about which “PCIe transmission is costly”?

#4 about “auto-copy between layers” seems like this is talking about inter-operator parallel, as I know tvm currently not analysis and do inter-operator parallel yet, do this proposal plan to add such support to tvm?

for “Major changes” , about “there is no need to launch additional service”(e.g., rpc server), this is a existing feature related deploy, after build network moodle, vta can running locally with any language (c++/python etc), here is a deploy example https://github.com/apache/incubator-tvm-vta/pull/5 ,

about “Change VTA runtime to support batch queue synchronization”, seems like this is current VTA logic, could I know some detail about the different between existing logic and this new synchronization logic?

about “DFS traversal”, as I know tvm seems like do network node compute sequentially instead of DFS, could I know what does this “DFS traversal” means?

about “except first layer and last layer (in FPGA)”, currently lot solution include vta(fist conv) do this, but there are also some solution offloaded all conv include #1 layer into FPGA,could I know what is the concern for putting first/last layer in cpu at this proposal?

for “Limitations” , about “all instructions are running sequentially”, this may cause big performance problem because memory hiding by pipe line TLPP.

Regards

Hua

Thanks @hjiang for the comments.

#1 a new driver with PCIe support is not enough. As there is no mechanism to deal with mix of CPU and FPGA ops. We have to insert a device_copy op if two adjacent layers resident in different devices. The current VTA allocate all the memory in FPGA, and both CPU (ARM) ops and FPGA ops are accessing the same memory area.

#2 “different programming models” mainly means the differences during hardware implementation (e.g., OpenCL vs Xilinx HLS). What do you mean by the scalability issue? Could you give more details?

@remotego Could you help elaborate on this part a bit?

It is not necessary to put all params data into FPGA sram one time. Actually we do not change the original behaviour. That’s, all the params data are put in FPGA DRAM during initialisation, and we run the graph layer by layer. The only thing we do is to ensure that all the ops of middle layers can be run in FPGA (implement vta compute and schedule for all middle layers).

I think we do not change anything for this part, compared with original VTA. Since weights/bias are only transferred once for one model, I think the cost should be ok?

Yes. DMA is used for PCIe transmission. But the setup cost for DMA is non-negligible. Compared with DRAM bus, PCIe DMA is costly.

I think “auto-copy” here is not dealing with inter-operator parallel. It is used to make data accesible by corresponding devices. Here is an example.

MaxPool (on CPU) -> Conv2D (on FPGA) -> xxx

In order to have this to work, we have to insert a device_copy between the CPU op and FPGA op. After the insertion, it will become:

MaxPool (on CPU) -> device_copy -> Conv2D (on FPGA) -> xxx

Thanks for the information. For this Major change, we actually mean to the end user. In other words, how the end users run an inference on FPGA. There is no much code changes. Currently, we re-use the simulation code (i.e., LocalSession). We’ll take a look at the new feature, and see what can we borrow. Thanks.

Current VTA does the synchronization for every layer. We propose to provide an option to do this for every inference (multiple layers).

For this, I mean the device annotation code here.

Actually we do put the first/last layer in CPU. Currently I think VTA does not support channels size < BLOCK, and max/avg_pools are not supported neither. So we just let these layers run in CPU. Did I get your point?

Yes. This may be a potential performance issue. @remotego Could you elaborate more on this?