Browsing by Author "Sousa, Leonel"
Now showing 1 - 10 of 12
Results Per Page
Sort Options
- An efficient scalable RNS architecture for large dynamic rangesPublication . Miguens Matutino, Pedro; Chaves, Ricardo; Sousa, LeonelThis paper proposes an efficient scalable Residue Number System (RNS) architecture supporting moduli sets with an arbitrary number of channels, allowing to achieve larger dynamic range and a higher level of parallelism. The proposed architecture allows the forward and reverse RNS conversion, by reusing the arithmetic channel units. The arithmetic operations supported at the channel level include addition, subtraction, and multiplication with accumulation capability. For the reverse conversion two algorithms are considered, one based on the Chinese Remainder Theorem and the other one on Mixed-Radix-Conversion, leading to implementations optimized for delay and required circuit area. With the proposed architecture a complete and compact RNS platform is achieved. Experimental results suggest gains of 17 % in the delay in the arithmetic operations, with an area reduction of 23 % regarding the RNS state of the art. When compared with a binary system the proposed architecture allows to perform the same computation 20 times faster alongside with only 10 % of the circuit area resources.
- Arithmetic-based binary-to-RNS converter modulo {2(n)+/- k} for jn-Bit dynamic rangePublication . Miguens Matutino, Pedro; Chaves, Ricardo; Sousa, LeonelIn this brief, a read-only-memoryless structure for binary-to-residue number system (RNS) conversion modulo {2(n) +/- k} is proposed. This structure is based only on adders and constant multipliers. This brief is motivated by the existing {2(n) +/- k} binary-to-RNS converters, which are particular inefficient for larger values of n. The experimental results obtained for 4n and 8n bits of dynamic range suggest that the proposed conversion structures are able to significantly improve the forward conversion efficiency, with an AT metric improvement above 100%, regarding the related state of the art. Delay improvements of 2.17 times with only 5% area increase can be achieved if a proper selection of the {2(n) +/- k} moduli is performed.
- A Compact and Scalable RNS ArchitecturePublication . Miguens Matutino, Pedro; Chaves, Ricardo; Sousa, LeonelThis paper proposes a unified architecture for designing Residue Number System (RNS) based processors for moduli sets with an arbitrary number of channels. Recently, new RNS moduli sets have been proposed in order to increase the dynamic range and reduce the width of the channels. The proposed architecture allows designing forward and reverse RNS converters, as well as the arithmetic operators of each modulo channel. The forward and reverse conversions are implemented using channel arithmetic units, resulting in a very compact architecture. Moreover, the arithmetic operations supported at the channel level include addition, subtraction, and multiplication with accumulation capability. The presented results suggest that the proposed RNS architecture leads to compact and scalable implementations, with competitive, or even better, performance when compared with the related state of the art, considering fixed moduli sets. Experimental results suggest gains of 17% in the delay of arithmetic operations, with an area reduction of 23% regarding the state of the art.
- Fully parameterizable VLSI architecture for sub-pixel motion estimation with low memory bandwidth requirementsPublication . Dias, Tiago; Roma, Nuno; Sousa, LeonelThis paper proposes a new scalable and efficient VLSI type-II architecture for real-time motion estimation optimized for subpel refinement algorithms. Based on the proposed architecture, which provides minimum latency, maximum throughput, and full utilization of the hardware resources, the implementation of a dedicated motion estimation coprocessor is also presented in this paper. This circuit is characterized by low memory bandwidth requirements, a modular and highly flexible structure and is capable of estimating motion vectors with half-pixel accuracy using the bilinear interpolation algorithm. Experimental results for implementations on ASIC and FPGA devices show that by using the proposed architecture it is possible to estimate motion vectors up to the 16CIF image format in real-time, with any given sub-pixel accuracy.
- Hardware/software co-design of H.264/AVC encoders for multi-core embedded systemsPublication . Dias, Tiago; Roma, Nuno; Sousa, LeonelThis paper presents a multi-core H.264/AVC encoder suitable for implementations in small and medium complexity embedded systems. The proposed structure results from an efficient hardware/software co-design methodology, where the encoder software application is highly optimized and structured in a very modular and efficient manner, so as to allow its most complex and time consuming operations to be offloaded to dedicated hardware accelerators. The considered methodology adopts a simple and efficient core interconnection mechanism to easily allow the inclusion and the removal of such optimized processing cores. Experimental results obtained with the implementation in a Virtex4 FPGA of an H.264/AVC encoder using an ASIP IP core as a ME hardware accelerator have proven the advantages of this methodology. For the considered system, speedup factors greater than 15 were obtained with a very modest increase of the involved hardware resources.
- High performance IP core for HEVC quantizationPublication . Dias, Tiago; Roma, Nuno; Sousa, LeonelA new class of quantization architectures suitable for the realization of high performance and hardware efficient forward, inverse and unified quantizers for HEVC is presented. The proposed structures are based on a highly flexible and optimized integer datapath that can be configured to provide several pipelined and non-pipelined implementations, offering distinct trade-offs between performance and hardware cost, which makes them highly suitable for most video coding application domains. The experimental results obtained using a 90 nm CMOS process show that the proposed class of quantization architectures is able to process 4k UHDTV video sequences in real-time (3840 x 2160 @ 30fps), with a power consumption as low as 3.9 mW when the unified architecture is operated at 374 MHz.
- High Performance Multi-Standard Architecture for DCT Computation in H.264/AVC High Profile and HEVC CodecsPublication . Dias, Tiago; Roma, Nuno; Sousa, LeonelA new high performance architecture for the computation of all the DCT operations adopted in the H.264/AVC and HEVC standards is proposed in this paper. Contrasting to other dedicated transform cores, the presented multi-standard transform architecture is supported on a completely configurable, scalable and unified structure, that is able to compute not only the forward and the inverse 8×8 and 4×4 integer DCTs and the 4×4 and 2×2 Hadamard transforms defined in the H.264/AVC standard, but also the 4×4, 8×8, 16×16 and 32×32 integer transforms adopted in HEVC. Experimental results obtained using a Xilinx Virtex-7 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which outperforms its more prominent related designs by at least 1.8 times. When integrated in a multi-core embedded system, this architecture allows the computation, in real-time, of all the transforms mentioned above for resolutions as high as the 8k Ultra High Definition Television (UHDTV) (7680×4320 @ 30fps).
- High throughput and scalable architecture for unified transform coding in embedded H.264/AVC video coding systemsPublication . Dias, Tiago; Lopez, Sebastian; Roma, Nuno; Sousa, LeonelAn innovative high throughput and scalable multi-transform architecture for H.264/AVC is presented in this paper. This structure can be used as a hardware accelerator in modern embedded systems to efficiently compute the 4×4 forward/inverse integer DCT, as well as the 2-D 4×4 / 2×2 Hadamard transforms. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-4 FPGA demonstrate the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area at least 1.8× higher than other similar recently published designs. Furthermore, such results also showed that this architecture can compute, in realtime, all the above mentioned H.264/AVC transforms for video sequences with resolutions up to UHDV.
- Pipelined FPGA coprocessor for elliptic curve cryptography based on residue number systemPublication . Miguens Matutino, Pedro; Araújo, Juvenal; Sousa, Leonel; Chaves, RicardoIn this paper a novel pipelined FPGA coprocessor for ECC is proposed, exploiting the parallelism capabilities of RNS to the computation of large operand algorithms. This intrinsic characteristic of representing large integer numbers as a set of smaller and independent values allows for the parallelization of the computationally heavy large operand multiplications, required in asymmetrical cryptographic algorithms. Towards a compact and performance efficient design, the RNS coprocessor supports a single highly pipelined multi-modulo arithmetic unit. Implementation results, on FPGA of this RNS based ECC coprocessor, suggest one of the smallest programmable designs with a proportional performance when compared with related state of the art. Additionally, the resulting architecture allows for the computation of varying key sizes without changing the design or its implementation.
- ROM-less RNS-to-binary converter moduli {22N − 1, 22N + 1, 2N − 3, 2N + 3}Publication . Miguens Matutino, Pedro; Chaves, Ricardo; Sousa, LeonelIn this paper, a novel ROM-less RNS-to-binary converter is proposed, using a new balanced moduli set {22n-1, 22n + 1, 2n-3, 2n + 3} for n even. The proposed converter is implemented with a two stage ROM-less approach, which computes the value of X based only in arithmetic operations, without using lookup tables. Experimental results for 24 to 120 bits of Dynamic Range, show that the proposed converter structure allows a balanced system with 20% faster arithmetic channels regarding the related state of the art, while requiring similar area resources. This improvement in the channel's performance is enough to offset the higher conversion costs of the proposed converter. Furthermore, up to 20% better Power-Delay-Product efficiency metric can be achieved for the full RNS architecture using the proposed moduli set. © 2014 IEEE.