Logo do repositório
 
Miniatura indisponível
Publicação

Comparison of distributed computing approaches to complexity of n-gram extraction

Utilize este identificador para referenciar este registo.
Nome:Descrição:Tamanho:Formato: 
Comparison_PTrigo_ADEETC.pdf322.3 KBAdobe PDF Ver/Abrir

Orientador(es)

Resumo(s)

In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.

Descrição

Palavras-chave

Distributed Computing Text Processing n-gram Extraction

Contexto Educativo

Citação

AUBAKIROV, Sanzhar; TRIGO, Paulo; AHMED-ZAKI, Darhan – Comparison of distributed computing approaches to complexity of n-gram extraction. In Proceedings of the 5th International Conference on Data Management Technologies and Applications. Lisbon, Portugal: ScitePress, 2016. ISBN 978-989-758-193-9. <URL: http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0005943000250030. Pp. 25-30

Projetos de investigação

Unidades organizacionais

Fascículo

Editora

Scitepress

Licença CC

Métricas Alternativas