Loading...
31 results
Search Results
Now showing 1 - 10 of 31
- Hyperspectral compressive sensing - a low power consumption approachPublication . Nascimento, Jose; Véstias, Mário; Duarte, RuiHyperspectral imaging instruments allow data collection in hundreds of spectral bands for the same area on the surface of the Earth. The resulting multidimensional data cube typically comprises several GBs per light. Due to the extremely large volumes of data collected by imaging spectrometers, hyperspectral data compression, dimensionality reduction and Compressive Sensing (CS) techniques has received considerable interest in recent years. These data are usually acquired by a satellite or an airbone instrument and sent to a ground station on Earth for subsequent processing. Usually the bandwidth connection between the satellite/airborne platform and the ground station is reduced, which limits the amount of data that can be transmitted. As a result, there is a clear need for (either lossless or lossy) hyperspectral data compression techniques that can be applied on-board the imaging instrument. This paper, presents a study of the power and time consumption and accuracy of a parallel implementation for a spectral compressive acquisition method on a Jetson TX2 platform, which is well suited to perform vector operations such as dot products. This implementation exploits the architecture at low level, using shared memory and coalesced accesses to memory. The conducted experiments have been performed to demonstrate the applicability, in terms of accuracy, time consuming and power consumption of these methods for onboard processing. The results show that by using this low power consumption GPU is it possible to obtain real-time performance with a very limited power requirement.
- Low energy heterogeneous computing with multiple RISC-V and CGRA coresPublication . Fiolhais, Luís; Gonçalves, Fernando; Duarte, Rui P.; Véstias, Mário; Sousa, Jose T. DeThe idea of combining multiple CPU and CGRA cores is not in itself original but detailed characterizations of such architectures and measurements on compelling applications are difficult to find in the literature. Although commercial CPUs, GPUs and FPGAs are widely available, there are no commercial CGRAs, which may be attributed to the lack of metrics on performance, energy and cost. In this paper, we introduce a heterogeneous computing platform consisting of several RISC-V CPU and Versat CGRA cores. Implementation results for several instances of the architecture are presented. The CPU of choice is the promising open source RISC-V architecture, which has never been featured in CPU/CGRA architectures. This paper presents independent implementations of two RISC-V cores: a minimal one, useful as a simple controller, and a more performant 5-stage pipeline implementation. The RISC-V cores have been designed using the recent Chisel HDL, useful for automating tasks pertaining to the writing of RTL. The selected CGRA is the published Versat architecture, for which 4 different instances have been created. Implementation results for 2 FPGA families and ASIC technology nodes are presented: area, frequency and power. Applications cover digital audio and machine learning, demonstrating the versatility of the proposed platform at competitive area, frequency and energy footprints.
- XtokaxtikoX: a stochastic computing-based autonomous cyber-physical systemPublication . Duarte, Rui Policarpo; Neto, Horácio; Véstias, MárioThis paper presents XtokaxtikoX, a fully autonomous cyber-physical system employing only stochastic arithmetic to perform computations on its data-path. Traditional implementations of stochastic computing systems benefit from fast and compact implementation of arithmetic operators, and high tolerance to errors, but depend heavily on the conversion between stochastic bitstreams and binary to implement many parts of the system. Furthermore, if a system requires any interaction with analog electronic components it must have additional ADC/DAC conversion circuitry, which further increases the complexity of the system. Conversely, the proposed work is able to directly translate analog signals into stochastic bitstreams, process the stochastic bitstreams and finally control analog actuators relying only on the information on the stochastic bitstreams. Details on the architectures to accomplish such functionality are presented as well as other stochastic arithmetic units. This paper also presents a small stochastic computing-based autonomous cyber-physical system implemented on a Cyclone IV FPGA to carry out a proof-of-concept.
- Hybrid dot-product calculation for convolutional neural networks in FPGAPublication . Véstias, Mário; Duarte, Rui Policarpo; De Sousa, Jose; Cláudio de Campos Neto, HorácioConvolutional Neural Networks (CNN) are quite useful in edge devices for security, surveillance, and many others. Running CNNs in embedded devices is a design challenge since these models require high computing power and large memory storage. Data quantization is an optimization technique applied to CNN to reduce the computing and memory requirements. The method reduces the number of bits used to represent weights and activations, which consequently reduces the size of operands and of the memory. The method is more effective if hybrid quantization is considered in which data in different layers may have different bit widths. This article proposes a new hardware module to calculate dot-products of CNNs with hybrid quantization. The module improves the implementation of CNNs in low density FPGAs, where the same module runs dot-products of different layers with different data quantizations. We show implementation results in ZYNQ7020 and compare with state-of-the-art works. Improvements in area and performance are achieved with the new proposed module.
- Stochastic theater: stochastic datapath generation framework for fault-tolerant IoT sensorsPublication . Duarte, Rui Policarpo; Véstias, Mário; Carvalho, Carlos; Casaleiro, JoãoStochastic Computing has emerged as a competitive computing paradigm that produces fast and simple implementations of arithmetic operations, while offering high levels of parallelism, and graceful degradation of the results when in the presence of errors. IoT devices are often operate under limited power and area constraints and subjected to harsh environments, for which, traditional computing paradigms struggle to provide high availability and fault-tolerance. Stochastic Computing is based on the computation of pseudo-random sequences of bits, hence requiring only a single bit per signal, rather than a data-bus. Notwithstanding, we haven’t witnessed its inclusion in custom computing systems. In this direction, this work presents Stochastic Theater, a framework to specify, simulate, and test Stochastic Datapaths to perform computations using stochastic bitstreams targeting IoT systems. In virtue of the granularity of the bitstreams, the bit-level specification of circuits, high-performance characteristics and reconfigurable capabilities, FPGAs were adopted to implement and test such systems. The proposed framework creates Stochastic Machines from a set of user defined arithmetic expressions, and then tests them with the corresponding input values and specific fault injection patterns. Besides the support to create autonomous Stochastic Computing systems, the presented framework also provides generation of stochastic units, being able to produce estimates on performance, resources and power. A demonstration is presented targeting KLT, typical method for data compression in IoT applications.
- Efficient Implementation Of A Single-Precision Floating-Point Arithmetic Unit on FPGAPublication . José, Wilson; Silva, Ana Rita; Neto, Horácio; Véstias, MárioThis paper presents a single precision floating point arithmetic unit with support for multiplication, addition, fused multiply-add, reciprocal, square-root and inverse squareroot with high-performance and low resource usage. The design uses a piecewise 2nd order polynomial approximation to implement reciprocal, square-root and inverse square-root. The unit can be configured with any number of operations and is capable to calculate any function with a throughput of one operation per cycle. The floatingpoint multiplier of the unit is also used to implement the polynomial approximation and the fused multiply-add operation. We have compared our implementation with other state-of-the-art proposals, including the Xilinx Core-Gen operators, and conclude that the approach has a high relative performance/area efficiency. © 2014 Technical University of Munich (TUM).
- System-on-chip field-programmable gate array design for onboard real-time hyperspectral unmixingPublication . Nascimento, Jose; Véstias, MárioHyperspectral instruments have been incorporated in satellite missions, providing large amounts of data of high spectral resolution of the Earth surface. This data can be used in remote sensing applications that often require a real-time or near-real-time response. To avoid delays between hyperspectral image acquisition and its interpretation, the last usually done on a ground station, onboard systems have emerged to process data, reducing the volume of information to transfer from the satellite to the ground station. For this purpose, compact reconfigurable hardware modules, such as field-programmable gate arrays (FPGAs), are widely used. This paper proposes an FPGA-based architecture for hyperspectral unmixing. This method based on the vertex component analysis (VCA) and it works without a dimensionality reduction preprocessing step. The architecture has been designed for a low-cost Xilinx Zynq board with a Zynq-7020 system-on-chip FPGA-based on the Artix-7 FPGA programmable logic and tested using real hyperspectral data. Experimental results indicate that the proposed implementation can achieve real-time processing, while maintaining the methods accuracy, which indicate the potential of the proposed platform to implement high-performance, low-cost embedded systems, opening perspectives for onboard hyperspectral image processing.
- Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplicationPublication . José, Wilson M.; Silva, Ana Rita; Véstias, Mário; Neto, HorácioRecent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm(2) integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm(2) chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.
- Lite-CNN: a high-performance architecture to execute CNNs in low density FPGAsPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, HorácioDue to the computational complexity of Convolutional Neural Networks (CNNs), high performance platforms are generally considered for their execution. However, CNNs are very useful in embedded systems and its execution right next to the source of data has many advantages, like avoiding the need for data communication. In this paper, we propose an architecture for CNN inference (Lite-CNN) that can achieve high performance in low density FPGAs. Lite-CNN adopts a fixed-point representation for both neurons and weights, which was already shown to be sufficient for most CNNs. Also, with a simple and known dot product reorganization, the number of multiplications is reduced to half. We show implementation results for 8 bit fixed-point in a ZYNQ7020 and extrapolate for other larger FPGAs. Lite-CNN achieves 410 GOPs in a ZYNQ7020.
- A many-core co-processor for embedded parallel computing on FPGAPublication . José, Wilson; Neto, Horácio; Véstias, MárioSingle processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.