Comparison of distributed computing approaches to complexity of n-gram extraction

Aubakirov, Sanzhar; Trigo, Paulo; Ahmed-Zaki, Darhan

http://hdl.handle.net/10400.21/6854

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Comparison_PTrigo_ADEETC.pdf		322.3 KB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Aubakirov, Sanzhar

Trigo, Paulo

Ahmed-Zaki, Darhan

Resumo(s)

In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.

Palavras-chave

Distributed Computing Text Processing n-gram Extraction

URI

http://hdl.handle.net/10400.21/6854

Citação

AUBAKIROV, Sanzhar; TRIGO, Paulo; AHMED-ZAKI, Darhan – Comparison of distributed computing approaches to complexity of n-gram extraction. In Proceedings of the 5th International Conference on Data Management Technologies and Applications. Lisbon, Portugal: ScitePress, 2016. ISBN 978-989-758-193-9. <URL: http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0005943000250030. Pp. 25-30

Editora

Scitepress

DOI

10.5220/0005943000250030

Coleções

ISEL - Eng. Elect. Tel. Comp. - Comunicações

Métricas Alternativas

Ver registo completo