A theoretical model for n-gram distribution in big data corpora

Silva, Joaquim F.; Gonçalves, Carlos Jorge de Sousa; Cunha, José C.

http://hdl.handle.net/10400.21/6829

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Theoretical_CGoncalves_ADEETC.pdf		167.66 KB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Silva, Joaquim F.

Gonçalves, Carlos Jorge de Sousa

Cunha, José C.

Resumo(s)

There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams, 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.

Palavras-chave

n-gram Models Big Data Zipf-Mandelbrot Law Poisson Distribution Extraction of Relevant Expressions

URI

http://hdl.handle.net/10400.21/6829

Citação

SILVA, Joaquim F.; GONÇALVES, Carlos; CUNHA, José C. – A theoretical model for n-gram distribution in big data corpora. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016. ISBN 978-1-4673-9006-4.

Editora

Institute of Electrical and Electronics Engineers

DOI

10.1109/BigData.2016.7840598

Coleções

ISEL - Eng. Elect. Tel. Comp. - Comunicações

Métricas Alternativas

Ver registo completo