Repository logo
 

Search Results

Now showing 1 - 10 of 14
  • Hybrid dot-product calculation for convolutional neural networks in FPGA
    Publication . Véstias, Mário; Duarte, Rui Policarpo; De Sousa, Jose; Cláudio de Campos Neto, Horácio
    Convolutional Neural Networks (CNN) are quite useful in edge devices for security, surveillance, and many others. Running CNNs in embedded devices is a design challenge since these models require high computing power and large memory storage. Data quantization is an optimization technique applied to CNN to reduce the computing and memory requirements. The method reduces the number of bits used to represent weights and activations, which consequently reduces the size of operands and of the memory. The method is more effective if hybrid quantization is considered in which data in different layers may have different bit widths. This article proposes a new hardware module to calculate dot-products of CNNs with hybrid quantization. The module improves the implementation of CNNs in low density FPGAs, where the same module runs dot-products of different layers with different data quantizations. We show implementation results in ZYNQ7020 and compare with state-of-the-art works. Improvements in area and performance are achieved with the new proposed module.
  • Onboard processing of synthetic aperture radar backprojection algorithm in FPGA
    Publication . Mota, David; Cruz, Helena; Miranda, Pedro R.; Duarte, Rui Policarpo; De Sousa, Jose; Cláudio de Campos Neto, Horácio; Véstias, Mário
    Synthetic aperture radar is a microwave technique to extracting image information of the target. Electromagnetic waves that are reflected from the target are acquired by the aircraft or satellite receivers and sent to a ground station to be processed by applying computational demanding algorithms. Radar data streams are acquired by an aircraft or satellite and sent to a ground station to be processed in order to extract images from the data since these processing algorithms are computationally demanding. However, novel applications require real-time processing for real-time analysis and decisions and so onboard processing is necessary. Running computationally demanding algorithms on onboard embedded systems with limited energy and computational capacity is a challenge. This article proposes a configurable hardware core for the execution of the backprojection algorithm with high performance and energy efficiency. The original backprojection algorithm is restructured to expose computational parallelism and then optimized by replacing floating-point with fixed-point arithmetic. The backprojection core was integrated into a system-onchip architecture and implemented in a field-programmable gate array. The proposed solution runs the optimized backprojection algorithm over images of sizes 512 x 512 and 1024 x 1024 in 0.14 s (0.41 J) and 1.11 s (3.24 J), respectively. The architecture is 2.6x faster and consumes 13x less energy than an embedded Jetson TX2 GPU. The solution is scalable and, therefore, a tradeoff exists between performance and utilization of resources.
  • Lite-CNN: a high-performance architecture to execute CNNs in low density FPGAs
    Publication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, Horácio
    Due to the computational complexity of Convolutional Neural Networks (CNNs), high performance platforms are generally considered for their execution. However, CNNs are very useful in embedded systems and its execution right next to the source of data has many advantages, like avoiding the need for data communication. In this paper, we propose an architecture for CNN inference (Lite-CNN) that can achieve high performance in low density FPGAs. Lite-CNN adopts a fixed-point representation for both neurons and weights, which was already shown to be sufficient for most CNNs. Also, with a simple and known dot product reorganization, the number of multiplications is reduced to half. We show implementation results for 8 bit fixed-point in a ZYNQ7020 and extrapolate for other larger FPGAs. Lite-CNN achieves 410 GOPs in a ZYNQ7020.
  • Implementing CNNs using a linear array of full mesh CGRAs
    Publication . Mário, Valter; Lopes, João D.; Véstias, Mário; De Sousa, Jose
    This paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs is more common and there are few works that address the topic of CNN acceleration using CGRAs. Using CGRAs can bring size and power advantages compared to GPUs and FPGAs. The contribution of this paper is to study the performance of full mesh dynamically and partially reconfigurable CGRAs for CNN acceleration. The CGRA used is an improved version of the previously published Versat CGRA, adding multi CGRA core support and pre-silicon configurability. The results show that the proposed CGRA is as easy to program as the original full mesh Versat CGRA, and that its performance and power consumption scale linearly with the number of instances.
  • Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning
    Publication . Véstias, Mário; Duarte, Rui Policarpo; De Sousa, Jose; Cláudio de Campos Neto, Horácio
    Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.
  • Parallel dot-products for deep learning on FPGA
    Publication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, Horácio
    Deep neural networks have recently shown great results in a vast set of image applications. The associated deep learning models are computationally very demanding and, therefore, several hardware solutions have been proposed to accelerate their computation. FPGAs have recently shown very good performances for these kind of applications and so it is considered a promising platform to accelerate the execution of deep learning algorithms. A common operation in these algorithms is multiply-accumulate (MACC) that is used to calculate dot-products. Since many dot products can be calculated in parallel, as long as memory bandwidth is available, it is very important to implement this operation very efficiently to increase the density of MACC units in an FPGA. In this paper, we propose an implementation of parallel MACC units in FPGA for dot-product operations with very high performance/area ratios using a mix of DSP blocks and LUTs. We consider fixed-point representations with 8 bits of size, but the method can be applied to other bit widths. The method allows us to achieve TOPs performances, even for low cost FPGAs.
  • Coarse-grained reconfigurable computing with the versat architecture
    Publication . D. Lopes, João; Véstias, Mário; Duarte, Rui Policarpo; Neto, Horácio C; De Sousa, Jose
    Reconfigurable computing architectures allow the adaptation of the underlying datapath to the algorithm. The granularity of the datapath elements and data width determines the granularity of the architecture and its programming flexibility. Coarse-grained architectures have shown the right balance between programmability and performance. This paper provides an overview of coarse-grained reconfigurable architectures and describes Versat, a Coarse-Grained Reconfigurable Array (CGRA) with self-generated partial reconfiguration, presented as a case study for better understanding these architectures. Unlike most of the existing approaches, which mainly use pre-compiled configurations, a Versat program can generate and apply myriads of on-the-fly configurations. Partial reconfiguration plays a central role in this approach, as it speeds up the generation of incrementally different configurations. The reconfigurable array has a complete graph topology, which yields unprecedented programmability, including assembly programming. Besides being useful for optimising programs, assembly programming is invaluable for working around post-silicon hardware, software, or compiler issues. Results on core area, frequency, power, and performance running different codes are presented and compared to other implementations.
  • K-means clustering on CGRA
    Publication . Lopes, João D.; De Sousa, Jose; Neto, Horácio; Véstias, Mário
    In this paper we present a k-means clustering algorithm for the Versat architecture, a small and low power Coarse Grained Reconfigurable Array (CGRA). This algorithm targets ultra low energy devices where using a GPU or FPGA accelerator is out of the question. The Versat architecture has been enhanced with pointer support, the possibility of using the address generators for general purposes, and cumulative and conditional operations for the ALUs. The algorithm is based on two hardware datapaths for the two basic steps of the algorithm: the assignment and the update steps. The program is fully parameterizable with the number of datapoints, centroids, coordinates, and memory pointers for reading and writing the data. The execution time scales linearly with the number of datapoints, centers or dimensions. The results show that the new Versat core is 9.4x smaller than an ARM Cortex A9 core, runs the algorithm 3.8x faster and consumes 46.3x less energy.
  • A full featured configurable accelerator for object detection with YOLO
    Publication . Pestana, Daniel; Miranda, Pedro R.; Lopes, João D.; Duarte, Rui; Véstias, Mário; Neto, Horácio C; De Sousa, Jose
    Object detection and classification is an essential task of computer vision. A very efficient algorithm for detection and classification is YOLO (You Look Only Once). We consider hardware architectures to run YOLO in real-time on embedded platforms. Designing a new dedicated accelerator for each new version of YOLO is not feasible given the fast delivery of new versions. This work's primary goal is to design a configurable and scalable core for creating specific object detection and classification systems based on YOLO, targeting embedded platforms. The core accelerates the execution of all the algorithm steps, including pre-processing, model inference and post-processing. It considers a fixed-point format, linearised activation functions, batch-normalisation, folding, and a hardware structure that exploits most of the available parallelism in CNN processing. The proposed core is configured for real-time execution of YOLOv3-Tiny and YOLOv4-Tiny, integrated into a RISC-V-based system-on-chip architecture and prototyped in an UltraScale XCKU040 FPGA (Field Programmable Gate Array). The solution achieves a performance of 32 and 31 frames per second for YOLOv3-Tiny and YOLOv4-Tiny, respectively, with a 16-bit fixed-point format. Compared to previous proposals, it improves the frame rate at a higher performance efficiency. The performance, area efficiency and configurability of the proposed core enable the fast development of real-time YOLO-based object detectors on embedded systems.
  • A fast and scalable architecture to run convolutional neural networks in low density FPGAs
    Publication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Neto, Horácio C
    Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045.