A parallel algorithm for statistical multiword term extraction from very large corpora

Gonçalves, Carlos; Silva, Joaquim F.; Cunha, Jose Alberto C.

Publicação

A parallel algorithm for statistical multiword term extraction from very large corpora

2015-11-30Documento de conferência

dc.contributor.author	Gonçalves, Carlos
dc.contributor.author	Silva, Joaquim F.
dc.contributor.author	Cunha, Jose Alberto C.
dc.date.accessioned	2019-02-14T11:38:15Z
dc.date.available	2019-02-14T11:38:15Z
dc.date.issued	2015-11-30
dc.description.abstract	Multi-word Relevant Expressions (REs) can be defined as sequences of words (n-grams) with strong semantic meaning, such as "ice melting" and "Ministere des Affaires Etrangeres", useful in Information Retrieval, Document Clustering or Classification and Indexing of Documents. The need of extracting REs in several languages led research on statistical approaches rather than symbolic methods, since the former allow language-independence. Based on the assumption that REs have strong cohesion between their consecutive n-grams, the LocalMaxs algorithm is a language independent approach that extracts REs. Apart from its good precision, this extractor is time-consuming, being inoperable for Big Data if implemented in a sequential manner. This paper presents the first parallel and distributed version of this algorithm, achieving almost linear speedup and sizeup when processing corpora up to 1 billion words, using up to 54 virtual machines in a public cloud. This parallel version of the algorithm explores the statistical knowledge of the n-grams in the corpus, to promote the locality of the references.	pt_PT
dc.description.version	info:eu-repo/semantics/publishedVersion	pt_PT
dc.identifier.citation	parallel algorithm for statistical multiword term extraction from very large corpora. In 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS). New York, USA: IEEE, 2015. ISBN 978-1-4799-8937-9. Pp. 219-224	pt_PT
dc.identifier.doi	10.1109/HPCC-CSS-ICESS.2015.72	pt_PT
dc.identifier.isbn	978-1-4799-8936-2
dc.identifier.uri	http://hdl.handle.net/10400.21/9500
dc.language.iso	eng	pt_PT
dc.publisher	Institute of Electrical and Electronics Engineers	pt_PT
dc.relation.publisherversion	https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336167	pt_PT
dc.subject	Text mining	pt_PT
dc.subject	Large corpora	pt_PT
dc.subject	Multiword terms	pt_PT
dc.subject	Statistical extraction	pt_PT
dc.subject	Parallel processing	pt_PT
dc.subject	Cloud	pt_PT
dc.title	A parallel algorithm for statistical multiword term extraction from very large corpora	pt_PT
dc.type	conference object
dspace.entity.type	Publication
oaire.citation.conferencePlace	24-26 Aug. 2015 - New York, USA	pt_PT
oaire.citation.endPage	224	pt_PT
oaire.citation.startPage	219	pt_PT
oaire.citation.title	2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC)	pt_PT
person.familyName	Cunha
person.givenName	Jose Alberto C.
person.identifier.orcid	0000-0001-6729-8348
person.identifier.scopus-author-id	7102903739
rcaap.rights	closedAccess	pt_PT
rcaap.type	conferenceObject	pt_PT
relation.isAuthorOfPublication	0a3455dd-21ed-4934-9980-4ce73f77edc5
relation.isAuthorOfPublication.latestForDiscovery	0a3455dd-21ed-4934-9980-4ce73f77edc5

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: CGoncalves.pdf
Tamanho:: 520.84 KB
Formato:: Adobe Portable Document Format

Ver/Abrir

Licença

A mostrar 1 - 1 de 1

Nome:: license.txt
Tamanho:: 1.71 KB
Formato:: Item-specific license agreed upon to submission
Descrição:

Ver/Abrir

Coleções

ISEL - Eng. Elect. Tel. Comp. - Comunicações