Repository logo
 
No Thumbnail Available
Publication

Model selection in discrete clustering: the EM-MML algorithm

Use this identifier to reference this record.

Advisor(s)

Abstract(s)

Finite mixture models are widely used for cluster analysis in several areas of application. They are commonly estimated through likelihood maximization (using diverse variants of the expectation-maximization algorithm) and the number of components (or clusters) is determined resorting to information criteria: the EM algorithm is run several times and then one of the pre-estimated candidate models is selected (e.g. using the BIC criterion). We propose a new clustering approach to deal with the clustering of categorical data (quite common in social sciences) and simultaneously identify the number of clusters - the EM-MML algorithm. This approach assumes that the data comes from a finite mixture of multinomials and uses a variant of EM to estimate the model parameters and a minimum message length (MML) criterion to estimate the number of clusters. EM-MML thus seamlessly integrates estimation and model selection in a single algorithm. The EM-MML is compared with traditional EM approaches, using alternative information criteria. Comparisons rely on synthetic datasets and also on a real dataset (data from the European Social Survey). The results obtained illustrate the parsimony of the EM-MML solutions as well as their clusters cohesion-separation and stability. A clear advantage of EM-MML is also the computation time.

Description

Keywords

Finite mixture models EM-MML algorithm Number of clusters

Citation

SILVESTRE, Cláudia; CARDOSO, Margarida; FIGUEIREDO, Mário - Model selection in discrete clustering: the EM-MML algorithm. In: International Conference of the ERCIM WG on Computational and Methodological Statistics, 9th, Sevilha, Espanha, (Universidad de Sevilla), 2016 (9-11 de dezembro)

Research Projects

Organizational Units

Journal Issue

Publisher

CMStatistics