Logo do repositório
 
Miniatura indisponível
Publicação

An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs

Utilize este identificador para referenciar este registo.

Orientador(es)

Resumo(s)

LocalMaxs extracts relevant multiword terms based on their cohesion but is computationally intensive, a critical issue for very large natural language corpora. The corpus properties concerning n-gram distribution determine the algorithm complexity and were empirically analyzed for corpora up to 982 million words. A parallel LocalMaxs implementation exhibits almost linear relative efficiency, speedup, and sizeup, when executed with up to 48 cloud virtual machines and a distributed key-value store. To reduce the remote data communication, we present a novel n-gram cache with cooperative-based warm-up, leading to reduced miss ratio and time penalty. A cache analytical model is used to estimate the performance of cohesion calculation of n-gram expressions, based on corpus empirical data. The model estimates agree with the real execution results.

Descrição

Palavras-chave

Large corpora Statistical extraction Multiword terms Parallel processing n-gram cache Performance evaluation Cloud computing

Contexto Educativo

Citação

GONÇALVES, Carlos; SILVA, Joaquim F.; CUNHA, José C. – An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In Proceeding of the 2016 IEEE 12th International Conference on e-Science (e-Science). Baltimore, MD, USA: IEEE, 2016. ISBN 978-1-5090-4273-9. Pp. 120-129

Projetos de investigação

Unidades organizacionais

Fascículo

Editora

Institute of Electrical and Electronics Engineers

Licença CC

Métricas Alternativas