Model selection in discrete clustering: the EM-MML algorithm

Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário

http://hdl.handle.net/10400.21/7688

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
Model selection in discrete clustering- international conference.pdf		175.31 KB	Adobe PDF	Download

Send Feedback

Authors

Silvestre, Cláudia

Cardoso, Margarida

Figueiredo, Mário

Abstract(s)

Finite mixture models are widely used for cluster analysis in several areas of application. They are commonly estimated through likelihood maximization (using diverse variants of the expectation-maximization algorithm) and the number of components (or clusters) is determined resorting to information criteria: the EM algorithm is run several times and then one of the pre-estimated candidate models is selected (e.g. using the BIC criterion). We propose a new clustering approach to deal with the clustering of categorical data (quite common in social sciences) and simultaneously identify the number of clusters - the EM-MML algorithm. This approach assumes that the data comes from a finite mixture of multinomials and uses a variant of EM to estimate the model parameters and a minimum message length (MML) criterion to estimate the number of clusters. EM-MML thus seamlessly integrates estimation and model selection in a single algorithm. The EM-MML is compared with traditional EM approaches, using alternative information criteria. Comparisons rely on synthetic datasets and also on a real dataset (data from the European Social Survey). The results obtained illustrate the parsimony of the EM-MML solutions as well as their clusters cohesion-separation and stability. A clear advantage of EM-MML is also the computation time.

Keywords

Finite mixture models EM-MML algorithm Number of clusters

URI

http://hdl.handle.net/10400.21/7688

Citation

SILVESTRE, Cláudia; CARDOSO, Margarida; FIGUEIREDO, Mário - Model selection in discrete clustering: the EM-MML algorithm. In: International Conference of the ERCIM WG on Computational and Methodological Statistics, 9th, Sevilha, Espanha, (Universidad de Sevilla), 2016 (9-11 de dezembro)