Publication
An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs
dc.contributor.author | Gonçalves, Carlos | |
dc.contributor.author | Silva, Joaquim F. | |
dc.contributor.author | Cunha, José C. | |
dc.date.accessioned | 2019-03-06T10:17:22Z | |
dc.date.available | 2019-03-06T10:17:22Z | |
dc.date.issued | 2017-03-06 | |
dc.description.abstract | LocalMaxs extracts relevant multiword terms based on their cohesion but is computationally intensive, a critical issue for very large natural language corpora. The corpus properties concerning n-gram distribution determine the algorithm complexity and were empirically analyzed for corpora up to 982 million words. A parallel LocalMaxs implementation exhibits almost linear relative efficiency, speedup, and sizeup, when executed with up to 48 cloud virtual machines and a distributed key-value store. To reduce the remote data communication, we present a novel n-gram cache with cooperative-based warm-up, leading to reduced miss ratio and time penalty. A cache analytical model is used to estimate the performance of cohesion calculation of n-gram expressions, based on corpus empirical data. The model estimates agree with the real execution results. | pt_PT |
dc.description.version | info:eu-repo/semantics/publishedVersion | pt_PT |
dc.identifier.citation | GONÇALVES, Carlos; SILVA, Joaquim F.; CUNHA, José C. – An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In Proceeding of the 2016 IEEE 12th International Conference on e-Science (e-Science). Baltimore, MD, USA: IEEE, 2016. ISBN 978-1-5090-4273-9. Pp. 120-129 | pt_PT |
dc.identifier.doi | 10.1109/eScience.2016.7870892 | pt_PT |
dc.identifier.isbn | 978-1-5090-4273-9 | |
dc.identifier.isbn | 978-1-5090-4272-2 | |
dc.identifier.isbn | 978-1-5090-4274-6 | |
dc.identifier.issn | 2325-372X | |
dc.identifier.uri | http://hdl.handle.net/10400.21/9637 | |
dc.language.iso | eng | pt_PT |
dc.publisher | Institute of Electrical and Electronics Engineers | pt_PT |
dc.relation.publisherversion | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7870892&tag=1 | pt_PT |
dc.subject | Large corpora | pt_PT |
dc.subject | Statistical extraction | pt_PT |
dc.subject | Multiword terms | pt_PT |
dc.subject | Parallel processing | pt_PT |
dc.subject | n-gram cache | pt_PT |
dc.subject | Performance evaluation | pt_PT |
dc.subject | Cloud computing | pt_PT |
dc.title | An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs | pt_PT |
dc.type | conference object | |
dspace.entity.type | Publication | |
oaire.awardURI | info:eu-repo/grantAgreement/FCT/5876/UID%2FCEC%2F04516%2F2013/PT | |
oaire.citation.conferencePlace | 23-27 Oct. 2016 - Baltimore, MD, USA | pt_PT |
oaire.citation.endPage | 129 | pt_PT |
oaire.citation.startPage | 120 | pt_PT |
oaire.citation.title | 12th International Conference on e-Science (e-Science) | pt_PT |
oaire.fundingStream | 5876 | |
project.funder.identifier | http://doi.org/10.13039/501100001871 | |
project.funder.name | Fundação para a Ciência e a Tecnologia | |
rcaap.rights | closedAccess | pt_PT |
rcaap.type | conferenceObject | pt_PT |
relation.isProjectOfPublication | cbce3bca-c959-4bd5-a02c-0f1de598f8e0 | |
relation.isProjectOfPublication.latestForDiscovery | cbce3bca-c959-4bd5-a02c-0f1de598f8e0 |