Empowering FPGAs For Massively Parallel Applications
Analytics
107 views ◎83 downloads ⇓
Abstract
The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. FPGAs with their customizable data-path, deep pipelining abilities and enhanced power efficiency features are the most viable solutions for programming and integrating them with heterogeneous platforms. At the same time, OpenCL for FPGAs raises many challenges which require in-depth understanding to better utilize their enormous capabilities. While OpenCL has been mainly practiced for GPU devices, research is required to further study the efficiency of OpenCL written codes on FPGAs and develop a framework which can help categorize OpenCL parallelism potentials to the fullest. Aim of this work is to identify, analyze and categorize the semantic differences between the OpenCL parallelism and the execution model on FPGAs. As an end result we propose a generic taxonomy for classifying FPGAs based on available support from the OpenCL-HLS tool-chain. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization.We introduce a novel approach for hiding the memory stalls on FPGAs when running massively parallel applications. The proposed approach is based on sub-kernel parallelism to decouple the actual computation from memory data access (memory read/write). This approach overlaps the computation of current threads with the memory access of future threads (memory pre-fetching at large scale). At the same time, this work proposes a LLVM-based static analyzer to detect the prefetchable data of OpenCL kernels with the capability to be integrated into commercial OpenCL-HLS tools. This approach leverages the OpenCL pipe semantic to realize the sub-kernel parallelism. The experimental results of Rodinia benchmarks on Intel Stratix-V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3\% increase in resource utilization, and 7\% increase in power consumption which reduces the overall energy consumption more than 40\%.To overcome the bottlenecks observed in the commercial OpenCL-HLS tool we propose an integrated tool chain for OpenCL-HLS. The new tool-chain is combination of already existing tool-chains for CPU, GPUs where LLVM acts as an intermediate machine level representation to translate from OpenCL to RTL. This open source tool chain is a proposed future extension of our work and we will be releasing it as an open source tool as a contribution of this thesis.