ThunderGP: HLS-based FPGA graphics processing framework
“The entire Internet e-commerce world is driven by graph analytics” because graph structures can naturally represent datasets in many important application domains, such as social networks, network security, and machine learning. The data from these applications places an urgent need for high-performance graphics processing.
A lot of research has gone into building efficient FPGA-based graphics processing accelerators; however, there is still a gap between high-level graphics applications and the underlying CPU-FPGA platform, which requires developers to understand hardware details and do a lot of programming (for example, programming using hardware description languages, pipeline tuning and memory optimization). This gap has largely hindered data center application developers from adopting FPGAs.
What is unique about ThunderGP ?
ThunderGP bridges the above gap by bringing performance and programmability to FPGA-accelerated graphics processing and has been accepted in FPGA’21.
ThunderGP is an open-source HLS-based graphics processing framework on FPGA, supports ViTIs and SDAccel development environments, and is suitable for Xilinx Alveo platforms, such as U50, U200, U250, and VCU1525. With ThunderGP, developers need only write high-level functions that use an explicit high-level language (C++)-based, hardware-independent API. ThunderGP then automatically generates and manages the deployment of high-performance accelerators on state-of-the-art FPGA platforms with multiple Super Logic Regions (SLRs).
Built-in accelerator templates.ThunderGP adopts the Gather-Apply-Scatter (GAS) model as an abstraction of various graph algorithms, and implements the model through built-in highly parallel and memory-efficient accelerator templates.
Automatic accelerator generation.Automatic accelerator generation produces synthesizable accelerators, unlocking the full potential of the underlying FPGA platform. In addition to built-in accelerator templates, it takes the scatter, gather, and apply stages of graph algorithms (from GAS models) and user-defined functions (UDFs) of FPGA platform models (e.g., U50 ) from developers as input.
Graph partitioning and scheduling.ThunderGP uses a target vertex-based vertical partitioning approach, enabling vertex buffering with on-chip RAM without introducing heavy preprocessing operations such as edge sorting
Advanced APIs.ThunderGP provides two sets of C++-based APIs: Accelerator API (Acc-API) for custom graphics algorithm accelerators and Host-API for accelerator deployment and execution.
How easy is ThunderGP to use?
We conduct a case study—COVID-19 spread prediction on an Alveo U50 board using ViTIs 2020.1—to show how easily ThunderGP can be applied to a real-life graphics processing problem.
Timely forecasting of infection prevalence at the population level over time is important for deploying appropriate lockdown measures, such as quarantine or social distancing, to mitigate virus spread. Current spread prediction models generally consist of spatial cellular automata (CA) and temporal susceptible infection removal (SIR) models, where a cell represents a residential area (such as a county) and maintains its state (such as infection rate) by the SIR model according to Transmissions between neighboring cells are updated. Therefore, propagation can be formulated as a graph processing problem, where counties and their connections are represented by a graph, and SIR is updated through propagation in the graph.
We implemented three propagation models using ThunderGP: the CA-SIR , CA-SEIR  and CA-SAIR  models. This dataset comes from the COVID-19 Impact Analysis Platform  and contains 3.1K counties and 2.3M connections.
Here we show an example of implementing an accelerator for the CA-SAIR model in Listing 1. For the dispersal phase, each county (a cell) calculates the infection rate based on its infection rate and its connection strength to push to neighboring counties, which quantifies the amount and frequency of inter-county flows. For the cluster phase, the county accumulates all infection rates pushed to it. During the application phase, the collected infection rate is used to calculate the infection rate. Note that the apply phase involves many user-defined parameters
Table 1 of the figure below quantifies the development effort involved in ThunderGP on this task and shows a performance comparison with a Python-based CPU implementation . According to the results, the benefits of using ThunderGP to solve this problem are twofold. First, ThunderGP achieves up to 419x speedup over CPU-based solutions. Being able to predict the spread within a short period of time can facilitate a quick and timely response to the spreading situation. Second, the CA-SIR model has evolved rapidly with the deepening understanding of viruses. Using ThunderGP, developers only need to write dozens of lines of code to speed up predictions usually in a day, which minimizes the development effort. This initial result is promising, and with the system being open-sourced, we believe that more case studies can be performed to further evaluate programmability improvements.
Remark:Code to format datasets into standard graph formats is not counted. Compile time for the FPGA image is not included in the development time.
How efficient is ThunderGP?
As mentioned earlier, there has been a large amount of research work on FPGA-based graphics processing accelerators. In this chapter, we demonstrate the efficiency of ThunderGP by performing a fair comparison with state-of-the-art designs. Please refer to the ThunderGP technical report for data set and graph applications.
We first compare ThunderGP with the state-of-the-art RTL-based work: Hitgraph , as shown in Table 2. The performance metric is million edge traversals per second (MTEPS). All implementations are based on four SLRs, but the difference is that HitGraph does not consider the overhead of using multiple SLRs, since its performance is based on simulations and simply scales to the memory bandwidth of multiple SLRs. Performance acceleration up to 2.9x. More importantly we let the design execute on real hardware.
We then compare ThunderGP with HLS-based frameworks. Since their experiments were not performed with multiple SLRs and thus have less memory bandwidth, for a fair comparison we use bandwidth efficiency (MTEPS/(GB/s)) as a metric. As shown in Table 3, ThunderGP achieves up to 29.2x absolute speedup and 12.3x bandwidth efficiency improvement over GraphOps, and 5.2x absolute speedup and 2.4x bandwidth efficiency improvement over Chen et al.
The above is the relevant content of the HLS-based FPGA graphics processing framework. If you are interested in FPGA content or have needs for electronic components and electronic chips, you are welcome to contact our senior technical engineers.
Haoxinshengic is a pprofessional FPGA and IC chip supplier in China. We have more than 15 years in this field。 If you need chips or other electronic components and other products, please contact us in time. We have an ultra-high cost performance spot chip supply and look forward to cooperating with you.
If you want to know more about FPGA or want to purchase related chip products, please contact our senior technical experts, we will answer relevant questions for you as soon as possible