Repository logo
 
No Thumbnail Available
Publication

Comparison of distributed computing approaches to complexity of n-gram extraction

Use this identifier to reference this record.
Name:Description:Size:Format: 
Comparison_PTrigo_ADEETC.pdf322.3 KBAdobe PDF Download

Advisor(s)

Abstract(s)

In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.

Description

Keywords

Distributed Computing Text Processing n-gram Extraction

Citation

AUBAKIROV, Sanzhar; TRIGO, Paulo; AHMED-ZAKI, Darhan – Comparison of distributed computing approaches to complexity of n-gram extraction. In Proceedings of the 5th International Conference on Data Management Technologies and Applications. Lisbon, Portugal: ScitePress, 2016. ISBN 978-989-758-193-9. <URL: http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0005943000250030. Pp. 25-30

Research Projects

Organizational Units

Journal Issue

Publisher

Scitepress

CC License

Altmetrics