An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs

Gonçalves, Carlos; Silva, Joaquim F.; Cunha, José C.

http://hdl.handle.net/10400.21/9637

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
CGoncalves.pdf		882.67 KB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Gonçalves, Carlos

Silva, Joaquim F.

Cunha, José C.

Resumo(s)

LocalMaxs extracts relevant multiword terms based on their cohesion but is computationally intensive, a critical issue for very large natural language corpora. The corpus properties concerning n-gram distribution determine the algorithm complexity and were empirically analyzed for corpora up to 982 million words. A parallel LocalMaxs implementation exhibits almost linear relative efficiency, speedup, and sizeup, when executed with up to 48 cloud virtual machines and a distributed key-value store. To reduce the remote data communication, we present a novel n-gram cache with cooperative-based warm-up, leading to reduced miss ratio and time penalty. A cache analytical model is used to estimate the performance of cohesion calculation of n-gram expressions, based on corpus empirical data. The model estimates agree with the real execution results.

Palavras-chave

Large corpora Statistical extraction Multiword terms Parallel processing n-gram cache Performance evaluation Cloud computing

URI

http://hdl.handle.net/10400.21/9637

Citação

GONÇALVES, Carlos; SILVA, Joaquim F.; CUNHA, José C. – An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In Proceeding of the 2016 IEEE 12th International Conference on e-Science (e-Science). Baltimore, MD, USA: IEEE, 2016. ISBN 978-1-5090-4273-9. Pp. 120-129

Projetos de investigação

Sem título

Projeto de investigaçãoVer mais

Editora

Institute of Electrical and Electronics Engineers

DOI

10.1109/eScience.2016.7870892

Coleções

ISEL - Eng. Elect. Tel. Comp. - Comunicações

Métricas Alternativas

Ver registo completo