Browsing by Issue Date, starting with "2017-10-05"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Parallel dot-products for deep learning on FPGAPublication . Véstias, Mário; Duarte, Rui; De Sousa, Jose; Cláudio de Campos Neto, HorácioDeep neural networks have recently shown great results in a vast set of image applications. The associated deep learning models are computationally very demanding and, therefore, several hardware solutions have been proposed to accelerate their computation. FPGAs have recently shown very good performances for these kind of applications and so it is considered a promising platform to accelerate the execution of deep learning algorithms. A common operation in these algorithms is multiply-accumulate (MACC) that is used to calculate dot-products. Since many dot products can be calculated in parallel, as long as memory bandwidth is available, it is very important to implement this operation very efficiently to increase the density of MACC units in an FPGA. In this paper, we propose an implementation of parallel MACC units in FPGA for dot-product operations with very high performance/area ratios using a mix of DSP blocks and LUTs. We consider fixed-point representations with 8 bits of size, but the method can be applied to other bit widths. The method allows us to achieve TOPs performances, even for low cost FPGAs.
- K-means clustering on CGRAPublication . Lopes, João D.; De Sousa, Jose; Neto, Horácio; Véstias, MárioIn this paper we present a k-means clustering algorithm for the Versat architecture, a small and low power Coarse Grained Reconfigurable Array (CGRA). This algorithm targets ultra low energy devices where using a GPU or FPGA accelerator is out of the question. The Versat architecture has been enhanced with pointer support, the possibility of using the address generators for general purposes, and cumulative and conditional operations for the ALUs. The algorithm is based on two hardware datapaths for the two basic steps of the algorithm: the assignment and the update steps. The program is fully parameterizable with the number of datapoints, centroids, coordinates, and memory pointers for reading and writing the data. The execution time scales linearly with the number of datapoints, centers or dimensions. The results show that the new Versat core is 9.4x smaller than an ARM Cortex A9 core, runs the algorithm 3.8x faster and consumes 46.3x less energy.