Repository logo
 
Publication

Comparison of distributed computing approaches to complexity of n-gram extraction

dc.contributor.authorAubakirov, Sanzhar
dc.contributor.authorTrigo, Paulo
dc.contributor.authorAhmed-Zaki, Darhan
dc.date.accessioned2017-03-14T10:59:43Z
dc.date.available2017-03-14T10:59:43Z
dc.date.issued2016
dc.description.abstractIn this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.pt_PT
dc.description.versioninfo:eu-repo/semantics/publishedVersionpt_PT
dc.identifier.citationAUBAKIROV, Sanzhar; TRIGO, Paulo; AHMED-ZAKI, Darhan – Comparison of distributed computing approaches to complexity of n-gram extraction. In Proceedings of the 5th International Conference on Data Management Technologies and Applications. Lisbon, Portugal: ScitePress, 2016. ISBN 978-989-758-193-9. <URL: http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0005943000250030. Pp. 25-30pt_PT
dc.identifier.doi10.5220/0005943000250030pt_PT
dc.identifier.urihttp://hdl.handle.net/10400.21/6854
dc.language.isoengpt_PT
dc.peerreviewedyespt_PT
dc.publisherScitepresspt_PT
dc.relation.publisherversionhttp://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0005943000250030pt_PT
dc.subjectDistributed Computingpt_PT
dc.subjectText Processingpt_PT
dc.subjectn-gram Extractionpt_PT
dc.titleComparison of distributed computing approaches to complexity of n-gram extractionpt_PT
dc.typeconference object
dspace.entity.typePublication
oaire.citation.endPage30pt_PT
oaire.citation.startPage25pt_PT
oaire.citation.titleScitepresspt_PT
rcaap.rightsopenAccesspt_PT
rcaap.typeconferenceObjectpt_PT

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Comparison_PTrigo_ADEETC.pdf
Size:
322.3 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: