Strategic Project - LA 21 - 2013-2014

Financiador

Organização

Publicações

Efficient Implementation Of A Single-Precision Floating-Point Arithmetic Unit on FPGA

Publication . José, Wilson; Silva, Ana Rita; Neto, Horácio; Véstias, Mário

This paper presents a single precision floating point arithmetic unit with support for multiplication, addition, fused multiply-add, reciprocal, square-root and inverse squareroot with high-performance and low resource usage. The design uses a piecewise 2nd order polynomial approximation to implement reciprocal, square-root and inverse square-root. The unit can be configured with any number of operations and is capable to calculate any function with a throughput of one operation per cycle. The floatingpoint multiplier of the unit is also used to implement the polynomial approximation and the fused multiply-add operation. We have compared our implementation with other state-of-the-art proposals, including the Xilinx Core-Gen operators, and conclude that the approach has a high relative performance/area efficiency. © 2014 Technical University of Munich (TUM).

2014-10Documento de conferência

Acesso restrito

Ver mais

Unified transform architecture for AVC, AVS, VC-1 and HEVC high-performance codecs

Publication . Dias, Tiago; Roma, Nuno; Sousa, Leonel

A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) transforms adopted by the most recent state-of-the-art digital video standards is presented in this paper. Contrasting to other designs with similar functionality, the presented architecture is supported on a scalable, modular and completely configurable processing structure. This flexible structure not only allows to easily reconfigure the architecture to support different transform kernels, but it also permits its resizing to efficiently support transforms of different orders (e. g. order-4, order-8, order-16 and order-32). Consequently, not only is it highly suitable to realize high-performance multi-standard transform cores, but it also offers highly efficient implementations of specialized processing structures addressing only a reduced subset of transforms that are used by a specific video standard. The experimental results that were obtained by prototyping several configurations of this processing structure in a Xilinx Virtex-7 FPGA show the superior performance and hardware efficiency levels provided by the proposed unified architecture for the implementation of transform cores for the Advanced Video Coding (AVC), Audio Video coding Standard (AVS), VC-1 and High Efficiency Video Coding (HEVC) standards. In addition, such results also demonstrate the ability of this processing structure to realize multi-standard transform cores supporting all the standards mentioned above and that are capable of processing the 8k Ultra High Definition Television (UHDTV) video format (7,680 x 4,320 at 30 fps) in real time.

2014-07Artigo científico

Acesso restrito

Ver mais

ROM-less RNS-to-binary converter moduli {22N − 1, 22N + 1, 2N − 3, 2N + 3}

Publication . Miguens Matutino, Pedro; Chaves, Ricardo; Sousa, Leonel

In this paper, a novel ROM-less RNS-to-binary converter is proposed, using a new balanced moduli set {22n-1, 22n + 1, 2n-3, 2n + 3} for n even. The proposed converter is implemented with a two stage ROM-less approach, which computes the value of X based only in arithmetic operations, without using lookup tables. Experimental results for 24 to 120 bits of Dynamic Range, show that the proposed converter structure allows a balanced system with 20% faster arithmetic channels regarding the related state of the art, while requiring similar area resources. This improvement in the channel's performance is enough to offset the higher conversion costs of the proposed converter. Furthermore, up to 20% better Power-Delay-Product efficiency metric can be achieved for the full RNS architecture using the proposed moduli set. © 2014 IEEE.

2014Documento de conferência

Acesso restrito

Ver mais

Trends Of CPU, GPU and FPGA for high-performance computing

Publication . Véstias, Mário; Neto, Horácio

Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems. © 2014 Technical University of Munich (TUM).

2014-10Documento de conferência

Acesso restrito

Ver mais

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Programa de financiamento

6817 - DCRRNI ID

Número da atribuição

PEst-OE/EEI/LA0021/2013