Véstias, Mário

Search Results

Now showing 1 - 9 of 9

Onboard processing of synthetic aperture radar backprojection algorithm in FPGA
Publication . Mota, David; Cruz, Helena; Miranda, Pedro R.; Duarte, Rui Policarpo; De Sousa, Jose; Cláudio de Campos Neto, Horácio; Véstias, Mário
Synthetic aperture radar is a microwave technique to extracting image information of the target. Electromagnetic waves that are reflected from the target are acquired by the aircraft or satellite receivers and sent to a ground station to be processed by applying computational demanding algorithms. Radar data streams are acquired by an aircraft or satellite and sent to a ground station to be processed in order to extract images from the data since these processing algorithms are computationally demanding. However, novel applications require real-time processing for real-time analysis and decisions and so onboard processing is necessary. Running computationally demanding algorithms on onboard embedded systems with limited energy and computational capacity is a challenge. This article proposes a configurable hardware core for the execution of the backprojection algorithm with high performance and energy efficiency. The original backprojection algorithm is restructured to expose computational parallelism and then optimized by replacing floating-point with fixed-point arithmetic. The backprojection core was integrated into a system-onchip architecture and implemented in a field-programmable gate array. The proposed solution runs the optimized backprojection algorithm over images of sizes 512 x 512 and 1024 x 1024 in 0.14 s (0.41 J) and 1.11 s (3.24 J), respectively. The architecture is 2.6x faster and consumes 13x less energy than an embedded Jetson TX2 GPU. The solution is scalable and, therefore, a tradeoff exists between performance and utilization of resources.
2022-04-25Journal article Open access Show more
Implementing CNNs using a linear array of full mesh CGRAs
Publication . Mário, Valter; Lopes, João D.; Véstias, Mário; De Sousa, Jose
This paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs is more common and there are few works that address the topic of CNN acceleration using CGRAs. Using CGRAs can bring size and power advantages compared to GPUs and FPGAs. The contribution of this paper is to study the performance of full mesh dynamically and partially reconfigurable CGRAs for CNN acceleration. The CGRA used is an improved version of the previously published Versat CGRA, adding multi CGRA core support and pre-silicon configurability. The results show that the proposed CGRA is as easy to program as the original full mesh Versat CGRA, and that its performance and power consumption scale linearly with the number of instances.
2020-04Conference object Restricted access Show more
Coarse-grained reconfigurable computing with the versat architecture
Publication . D. Lopes, João; Véstias, Mário; Duarte, Rui Policarpo; Neto, Horácio C; De Sousa, Jose
Reconfigurable computing architectures allow the adaptation of the underlying datapath to the algorithm. The granularity of the datapath elements and data width determines the granularity of the architecture and its programming flexibility. Coarse-grained architectures have shown the right balance between programmability and performance. This paper provides an overview of coarse-grained reconfigurable architectures and describes Versat, a Coarse-Grained Reconfigurable Array (CGRA) with self-generated partial reconfiguration, presented as a case study for better understanding these architectures. Unlike most of the existing approaches, which mainly use pre-compiled configurations, a Versat program can generate and apply myriads of on-the-fly configurations. Partial reconfiguration plays a central role in this approach, as it speeds up the generation of incrementally different configurations. The reconfigurable array has a complete graph topology, which yields unprecedented programmability, including assembly programming. Besides being useful for optimising programs, assembly programming is invaluable for working around post-silicon hardware, software, or compiler issues. Results on core area, frequency, power, and performance running different codes are presented and compared to other implementations.
2021-03Journal article Open access Show more
A full featured configurable accelerator for object detection with YOLO
Publication . Pestana, Daniel; Miranda, Pedro R.; Lopes, João D.; Duarte, Rui; Véstias, Mário; Neto, Horácio C; De Sousa, Jose
Object detection and classification is an essential task of computer vision. A very efficient algorithm for detection and classification is YOLO (You Look Only Once). We consider hardware architectures to run YOLO in real-time on embedded platforms. Designing a new dedicated accelerator for each new version of YOLO is not feasible given the fast delivery of new versions. This work's primary goal is to design a configurable and scalable core for creating specific object detection and classification systems based on YOLO, targeting embedded platforms. The core accelerates the execution of all the algorithm steps, including pre-processing, model inference and post-processing. It considers a fixed-point format, linearised activation functions, batch-normalisation, folding, and a hardware structure that exploits most of the available parallelism in CNN processing. The proposed core is configured for real-time execution of YOLOv3-Tiny and YOLOv4-Tiny, integrated into a RISC-V-based system-on-chip architecture and prototyped in an UltraScale XCKU040 FPGA (Field Programmable Gate Array). The solution achieves a performance of 32 and 31 frames per second for YOLOv3-Tiny and YOLOv4-Tiny, respectively, with a 16-bit fixed-point format. Compared to previous proposals, it improves the frame rate at a higher performance efficiency. The performance, area efficiency and configurability of the proposed core enable the fast development of real-time YOLO-based object detectors on embedded systems.
2021-05-19Journal article Restricted access Show more
A fast and scalable architecture to run convolutional neural networks in low density FPGAs
Publication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Neto, Horácio C
Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045.
2020-09Journal article Restricted access Show more
Moving deep learning to the edge
Publication . Véstias, Mário; Duarte, Rui Policarpo; De Sousa, Jose; Neto, Horácio C
Deep learning is now present in a wide range of services and applications, replacing and complementing other machine learning algorithms. Performing training and inference of deep neural networks using the cloud computing model is not viable for applications where low latency is required. Furthermore, the rapid proliferation of the Internet of Things will generate a large volume of data to be processed, which will soon overload the capacity of cloud servers. One solution is to process the data at the edge devices themselves, in order to alleviate cloud server workloads and improve latency. However, edge devices are less powerful than cloud servers, and many are subject to energy constraints. Hence, new resource and energy-oriented deep learning models are required, as well as new computing platforms. This paper reviews the main research directions for edge computing deep learning algorithms.
2020-05Journal article Open access Show more
Configurable hardware core for IoT object detection
Publication . Miranda, Pedro R.; Pestana, Daniel; D. Lopes, João; Duarte, Rui Policarpo; Véstias, Mário; Neto, Horácio C; De Sousa, Jose
Object detection is an important task for many applications, like transportation, security, and medical applications. Many of these applications are needed on edge devices to make local decisions. Therefore, it is necessary to provide low-cost, fast solutions for object detection. This work proposes a configurable hardware core on a field-programmable gate array (FPGA) for object detection. The configurability of the core allows its deployment on target devices with diverse hardware resources. The object detection accelerator is based on YOLO, for its good accuracy at moderate computational complexity. The solution was applied to the design of a core to accelerate the Tiny-YOLOv3, based on a CNN developed for constrained environments. However, it can be applied to other YOLO versions. The core was integrated into a full system-on-chip solution and tested with the COCO dataset. It achieved a performance from 7 to 14 FPS in a low-cost ZYNQ7020 FPGA, depending on the quantization, with an accuracy reduction from 2.1 to 1.4 points of mAP50.
2021-10-30Journal article Open access Show more
A configurable architecture for running hybrid convolutional neural networks in low-density FPGAs
Publication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, Horácio
Convolutional neural networks have become the state of the art of machine learning for a vast set of applications, especially for image classification and object detection. There are several advantages to running inference on these models at the edge, including real-time performance and data privacy. The high computing and memory requirements of convolutional neural networks have been major obstacles to the broader deployment of CNNs on edge devices. Data quantization is an optimization method that reduces the number of bits used to represent weights and activations of a network model, minimizing storage requirements and computing complexity. Quantization can be applied at the layer level, by using different bit widths in different layers: this is called hybrid quantization. This article proposes a new efficient and configurable architecture for running CNNs with hybrid quantization in low-density Field-Programmable Gate Arrays (FPGAs) targeting edge devices. The architecture has been implemented on the Xilinx ZYNQ7020/45 devices and is running the AlexNet and VGG16 networks. Running AlexNet, the architecture has a throughput up to 508 images per second on the ZYNQ7020 device, and 1639 images per second on the ZYNQ7045 device. Considering VGG16, the architecture delivers up to 43 images per second on the ZYNQ7020 device, and 81 images per second on the ZYNQ7045 device. The proposed hybrid architecture achieves up to 13.7 x improvement in performance compared to state-of-the-art solutions, with small accuracy degradation.
2020-06-08Journal article Open access Show more
IOb-Cache: a high-performance configurable open-source cache
Publication . Roque, João, V.; Lopes, João D.; Véstias, Mário; De Sousa, Jose
sorts of open-source implementations of peripherals and other system-on-chip modules. Despite the recent advent of open-source hardware, the available open-source caches have low configurability, limited lack of support for single-cycle pipelined memory accesses, and use non-standard hardware interfaces. In this paper, the IObundle cache (IOb-Cache), a high-performance configurable open-source cache is proposed, developed and deployed. The cache has front-end and back-end modules for fast integration with processors and memory controllers. The front-end module supports the native interface, and the back-end module supports the native interface and the standard Advanced eXtensible Interface (AXI). The cache is highly configurable in structure and access policies. The back-end can be configured to read bursts of multiple words per transfer to take advantage of the available memory bandwidth. To the best of our knowledge, IOb-Cache is currently the only configurable cache that supports pipelined Central Processing Unit (CPU) interfaces and AXI memory bus interface. Additionally, it has a write-through buffer and an independent controller for fast, most of the time 1-cycle writing together with 1-cycle reading, while previous works only support 1-cycle reading. This allows the best clocks-per-Instruction (CPI) to be close to one (1.055). IOb-Cache is integrated into IOb System-on-Chip (IOb-SoC) Github repository, which has 29 stars and is already being used in 50 projects (forks).
2021-07-21Journal article Open access Show more

Véstias, Mário

Filters

Author

Subject

Date

Entity

Settings

Sort By

Results per page

Search Results