Repository logo
 
Publication

A parallel algorithm for statistical multiword term extraction from very large corpora

dc.contributor.authorGonçalves, Carlos
dc.contributor.authorSilva, Joaquim F.
dc.contributor.authorCunha, Jose Alberto C.
dc.date.accessioned2019-02-14T11:38:15Z
dc.date.available2019-02-14T11:38:15Z
dc.date.issued2015-11-30
dc.description.abstractMulti-word Relevant Expressions (REs) can be defined as sequences of words (n-grams) with strong semantic meaning, such as "ice melting" and "Ministere des Affaires Etrangeres", useful in Information Retrieval, Document Clustering or Classification and Indexing of Documents. The need of extracting REs in several languages led research on statistical approaches rather than symbolic methods, since the former allow language-independence. Based on the assumption that REs have strong cohesion between their consecutive n-grams, the LocalMaxs algorithm is a language independent approach that extracts REs. Apart from its good precision, this extractor is time-consuming, being inoperable for Big Data if implemented in a sequential manner. This paper presents the first parallel and distributed version of this algorithm, achieving almost linear speedup and sizeup when processing corpora up to 1 billion words, using up to 54 virtual machines in a public cloud. This parallel version of the algorithm explores the statistical knowledge of the n-grams in the corpus, to promote the locality of the references.pt_PT
dc.description.versioninfo:eu-repo/semantics/publishedVersionpt_PT
dc.identifier.citationparallel algorithm for statistical multiword term extraction from very large corpora. In 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS). New York, USA: IEEE, 2015. ISBN 978-1-4799-8937-9. Pp. 219-224pt_PT
dc.identifier.doi10.1109/HPCC-CSS-ICESS.2015.72pt_PT
dc.identifier.isbn978-1-4799-8936-2
dc.identifier.urihttp://hdl.handle.net/10400.21/9500
dc.language.isoengpt_PT
dc.publisherInstitute of Electrical and Electronics Engineerspt_PT
dc.relation.publisherversionhttps://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336167pt_PT
dc.subjectText miningpt_PT
dc.subjectLarge corporapt_PT
dc.subjectMultiword termspt_PT
dc.subjectStatistical extractionpt_PT
dc.subjectParallel processingpt_PT
dc.subjectCloudpt_PT
dc.titleA parallel algorithm for statistical multiword term extraction from very large corporapt_PT
dc.typeconference object
dspace.entity.typePublication
oaire.citation.conferencePlace24-26 Aug. 2015 - New York, USApt_PT
oaire.citation.endPage224pt_PT
oaire.citation.startPage219pt_PT
oaire.citation.title2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC)pt_PT
person.familyNameCunha
person.givenNameJose Alberto C.
person.identifier.orcid0000-0001-6729-8348
person.identifier.scopus-author-id7102903739
rcaap.rightsclosedAccesspt_PT
rcaap.typeconferenceObjectpt_PT
relation.isAuthorOfPublication0a3455dd-21ed-4934-9980-4ce73f77edc5
relation.isAuthorOfPublication.latestForDiscovery0a3455dd-21ed-4934-9980-4ce73f77edc5

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
CGoncalves.pdf
Size:
520.84 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: