Loading...
4 results
Search Results
Now showing 1 - 4 of 4
- Implementing CNNs using a linear array of full mesh CGRAsPublication . Mário, Valter; Lopes, João D.; Véstias, Mário; De Sousa, JoseThis paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs is more common and there are few works that address the topic of CNN acceleration using CGRAs. Using CGRAs can bring size and power advantages compared to GPUs and FPGAs. The contribution of this paper is to study the performance of full mesh dynamically and partially reconfigurable CGRAs for CNN acceleration. The CGRA used is an improved version of the previously published Versat CGRA, adding multi CGRA core support and pre-silicon configurability. The results show that the proposed CGRA is as easy to program as the original full mesh Versat CGRA, and that its performance and power consumption scale linearly with the number of instances.
- K-means clustering on CGRAPublication . Lopes, João D.; De Sousa, Jose; Neto, Horácio; Véstias, MárioIn this paper we present a k-means clustering algorithm for the Versat architecture, a small and low power Coarse Grained Reconfigurable Array (CGRA). This algorithm targets ultra low energy devices where using a GPU or FPGA accelerator is out of the question. The Versat architecture has been enhanced with pointer support, the possibility of using the address generators for general purposes, and cumulative and conditional operations for the ALUs. The algorithm is based on two hardware datapaths for the two basic steps of the algorithm: the assignment and the update steps. The program is fully parameterizable with the number of datapoints, centroids, coordinates, and memory pointers for reading and writing the data. The execution time scales linearly with the number of datapoints, centers or dimensions. The results show that the new Versat core is 9.4x smaller than an ARM Cortex A9 core, runs the algorithm 3.8x faster and consumes 46.3x less energy.
- A full featured configurable accelerator for object detection with YOLOPublication . Pestana, Daniel; Miranda, Pedro R.; Lopes, João D.; Duarte, Rui; Véstias, Mário; Neto, Horácio C; De Sousa, JoseObject detection and classification is an essential task of computer vision. A very efficient algorithm for detection and classification is YOLO (You Look Only Once). We consider hardware architectures to run YOLO in real-time on embedded platforms. Designing a new dedicated accelerator for each new version of YOLO is not feasible given the fast delivery of new versions. This work's primary goal is to design a configurable and scalable core for creating specific object detection and classification systems based on YOLO, targeting embedded platforms. The core accelerates the execution of all the algorithm steps, including pre-processing, model inference and post-processing. It considers a fixed-point format, linearised activation functions, batch-normalisation, folding, and a hardware structure that exploits most of the available parallelism in CNN processing. The proposed core is configured for real-time execution of YOLOv3-Tiny and YOLOv4-Tiny, integrated into a RISC-V-based system-on-chip architecture and prototyped in an UltraScale XCKU040 FPGA (Field Programmable Gate Array). The solution achieves a performance of 32 and 31 frames per second for YOLOv3-Tiny and YOLOv4-Tiny, respectively, with a 16-bit fixed-point format. Compared to previous proposals, it improves the frame rate at a higher performance efficiency. The performance, area efficiency and configurability of the proposed core enable the fast development of real-time YOLO-based object detectors on embedded systems.
- IOb-Cache: a high-performance configurable open-source cachePublication . Roque, João, V.; Lopes, João D.; Véstias, Mário; De Sousa, Josesorts of open-source implementations of peripherals and other system-on-chip modules. Despite the recent advent of open-source hardware, the available open-source caches have low configurability, limited lack of support for single-cycle pipelined memory accesses, and use non-standard hardware interfaces. In this paper, the IObundle cache (IOb-Cache), a high-performance configurable open-source cache is proposed, developed and deployed. The cache has front-end and back-end modules for fast integration with processors and memory controllers. The front-end module supports the native interface, and the back-end module supports the native interface and the standard Advanced eXtensible Interface (AXI). The cache is highly configurable in structure and access policies. The back-end can be configured to read bursts of multiple words per transfer to take advantage of the available memory bandwidth. To the best of our knowledge, IOb-Cache is currently the only configurable cache that supports pipelined Central Processing Unit (CPU) interfaces and AXI memory bus interface. Additionally, it has a write-through buffer and an independent controller for fast, most of the time 1-cycle writing together with 1-cycle reading, while previous works only support 1-cycle reading. This allows the best clocks-per-Instruction (CPI) to be close to one (1.055). IOb-Cache is integrated into IOb System-on-Chip (IOb-SoC) Github repository, which has 29 stars and is already being used in 50 projects (forks).