Loading...
8 results
Search Results
Now showing 1 - 8 of 8
- Hyperspectral compressive sensing - a low power consumption approachPublication . Nascimento, Jose; Véstias, Mário; Duarte, RuiHyperspectral imaging instruments allow data collection in hundreds of spectral bands for the same area on the surface of the Earth. The resulting multidimensional data cube typically comprises several GBs per light. Due to the extremely large volumes of data collected by imaging spectrometers, hyperspectral data compression, dimensionality reduction and Compressive Sensing (CS) techniques has received considerable interest in recent years. These data are usually acquired by a satellite or an airbone instrument and sent to a ground station on Earth for subsequent processing. Usually the bandwidth connection between the satellite/airborne platform and the ground station is reduced, which limits the amount of data that can be transmitted. As a result, there is a clear need for (either lossless or lossy) hyperspectral data compression techniques that can be applied on-board the imaging instrument. This paper, presents a study of the power and time consumption and accuracy of a parallel implementation for a spectral compressive acquisition method on a Jetson TX2 platform, which is well suited to perform vector operations such as dot products. This implementation exploits the architecture at low level, using shared memory and coalesced accesses to memory. The conducted experiments have been performed to demonstrate the applicability, in terms of accuracy, time consuming and power consumption of these methods for onboard processing. The results show that by using this low power consumption GPU is it possible to obtain real-time performance with a very limited power requirement.
- A review of synthetic-aperture radar image formation algorithms and implementations: a computational perspectivePublication . Cruz, Helena; Véstias, Mário; Monteiro, J; Cláudio de Campos Neto, Horácio; Duarte, RuiDesigning synthetic-aperture radar image formation systems can be challenging due to the numerous options of algorithms and devices that can be used. There are many SAR image formation algorithms, such as backprojection, matched-filter, polar format, Range–Doppler and chirp scaling algorithms. Each algorithm presents its own advantages and disadvantages considering efficiency and image quality; thus, we aim to introduce some of the most common SAR image formation algorithms and compare them based on these two aspects. Depending on the requisites of each individual system and implementation, there are many device options to choose from, for in stance, FPGAs, GPUs, CPUs, many-core CPUs, and microcontrollers. We present a review of the state of the art of SAR imaging systems implementations. We also compare such implementations in terms of power consumption, execution time, and image quality for the different algorithms used.
- Lite-CNN: a high-performance architecture to execute CNNs in low density FPGAsPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, HorácioDue to the computational complexity of Convolutional Neural Networks (CNNs), high performance platforms are generally considered for their execution. However, CNNs are very useful in embedded systems and its execution right next to the source of data has many advantages, like avoiding the need for data communication. In this paper, we propose an architecture for CNN inference (Lite-CNN) that can achieve high performance in low density FPGAs. Lite-CNN adopts a fixed-point representation for both neurons and weights, which was already shown to be sufficient for most CNNs. Also, with a simple and known dot product reorganization, the number of multiplications is reduced to half. We show implementation results for 8 bit fixed-point in a ZYNQ7020 and extrapolate for other larger FPGAs. Lite-CNN achieves 410 GOPs in a ZYNQ7020.
- Parallel dot-products for deep learning on FPGAPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, HorácioDeep neural networks have recently shown great results in a vast set of image applications. The associated deep learning models are computationally very demanding and, therefore, several hardware solutions have been proposed to accelerate their computation. FPGAs have recently shown very good performances for these kind of applications and so it is considered a promising platform to accelerate the execution of deep learning algorithms. A common operation in these algorithms is multiply-accumulate (MACC) that is used to calculate dot-products. Since many dot products can be calculated in parallel, as long as memory bandwidth is available, it is very important to implement this operation very efficiently to increase the density of MACC units in an FPGA. In this paper, we propose an implementation of parallel MACC units in FPGA for dot-product operations with very high performance/area ratios using a mix of DSP blocks and LUTs. We consider fixed-point representations with 8 bits of size, but the method can be applied to other bit widths. The method allows us to achieve TOPs performances, even for low cost FPGAs.
- A full featured configurable accelerator for object detection with YOLOPublication . Pestana, Daniel; Miranda, Pedro R.; Lopes, João D.; Duarte, Rui; Véstias, Mário; Neto, Horácio C; De Sousa, JoseObject detection and classification is an essential task of computer vision. A very efficient algorithm for detection and classification is YOLO (You Look Only Once). We consider hardware architectures to run YOLO in real-time on embedded platforms. Designing a new dedicated accelerator for each new version of YOLO is not feasible given the fast delivery of new versions. This work's primary goal is to design a configurable and scalable core for creating specific object detection and classification systems based on YOLO, targeting embedded platforms. The core accelerates the execution of all the algorithm steps, including pre-processing, model inference and post-processing. It considers a fixed-point format, linearised activation functions, batch-normalisation, folding, and a hardware structure that exploits most of the available parallelism in CNN processing. The proposed core is configured for real-time execution of YOLOv3-Tiny and YOLOv4-Tiny, integrated into a RISC-V-based system-on-chip architecture and prototyped in an UltraScale XCKU040 FPGA (Field Programmable Gate Array). The solution achieves a performance of 32 and 31 frames per second for YOLOv3-Tiny and YOLOv4-Tiny, respectively, with a 16-bit fixed-point format. Compared to previous proposals, it improves the frame rate at a higher performance efficiency. The performance, area efficiency and configurability of the proposed core enable the fast development of real-time YOLO-based object detectors on embedded systems.
- A fast and scalable architecture to run convolutional neural networks in low density FPGAsPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Neto, Horácio CDeep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045.
- Hyperspectral compressive sensing - a low power consumption approachPublication . Nascimento, Jose; Véstias, Mário; Duarte, RuiHyperspectral imaging instruments allow data collection in hundreds of spectral bands for the same area on the surface of the Earth. The resulting multidimensional data cube typically comprises several GBs per light. Due to the extremely large volumes of data collected by imaging spectrometers, hyperspectral data compression, dimensionality reduction and Compressive Sensing (CS) techniques has received considerable interest in recent years. These data are usually acquired by a satellite or an airbone instrument and sent to a ground station on Earth for subsequent processing. Usually the bandwidth connection between the satellite/airborne platform and the ground station is reduced, which limits the amount of data that can be transmitted. As a result, there is a clear need for (either lossless or lossy) hyperspectral data compression techniques that can be applied on-board the imaging instrument. This paper, presents a study of the power and time consumption and accuracy of a parallel implementation for a spectral compressive acquisition method on a Jetson TX2 platform, which is well suited to perform vector operations such as dot products. This implementation exploits the architecture at low level, using shared memory and coalesced accesses to memory. The conducted experiments have been performed to demonstrate the applicability, in terms of accuracy, time consuming and power consumption of these methods for onboard processing. The results show that by using this low power consumption GPU is it possible to obtain real-time performance with a very limited power requirement.
- A configurable architecture for running hybrid convolutional neural networks in low-density FPGAsPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, HorácioConvolutional neural networks have become the state of the art of machine learning for a vast set of applications, especially for image classification and object detection. There are several advantages to running inference on these models at the edge, including real-time performance and data privacy. The high computing and memory requirements of convolutional neural networks have been major obstacles to the broader deployment of CNNs on edge devices. Data quantization is an optimization method that reduces the number of bits used to represent weights and activations of a network model, minimizing storage requirements and computing complexity. Quantization can be applied at the layer level, by using different bit widths in different layers: this is called hybrid quantization. This article proposes a new efficient and configurable architecture for running CNNs with hybrid quantization in low-density Field-Programmable Gate Arrays (FPGAs) targeting edge devices. The architecture has been implemented on the Xilinx ZYNQ7020/45 devices and is running the AlexNet and VGG16 networks. Running AlexNet, the architecture has a throughput up to 508 images per second on the ZYNQ7020 device, and 1639 images per second on the ZYNQ7045 device. Considering VGG16, the architecture delivers up to 43 images per second on the ZYNQ7020 device, and 81 images per second on the ZYNQ7045 device. The proposed hybrid architecture achieves up to 13.7 x improvement in performance compared to state-of-the-art solutions, with small accuracy degradation.