Repository logo
 
Loading...
Profile Picture

Search Results

Now showing 1 - 5 of 5
  • An MML embedded approach for estimating the number of clusters
    Publication . Silvestre, Cláudia; Cardoso, Maria Margarida G. M. S.; Figueiredo, Mário
    Assuming that the data originate from a finite mixture of multinomial distributions, we study the performance of an integrated Expectation Maximization (EM) algorithm considering Minimum Message Length (MML) criterion to select the number of mixture components. The referred EM-MML approach, rather than selecting one among a set of pre-estimated candidate models (which requires running EM several times), seamlessly integrates estimation and model selection in a single algorithm. Comparisons are provided with EM combined with well-known information criteria – e.g. the Bayesian information Criterion. We resort to synthetic data examples and a real application. The EM-MML computation time is a clear advantage of this method; also, the real data solution it provides is more parsimonious, which reduces the risk of model order overestimation and improves interpretability
  • A clustering view on ESS measures of political interest: an EM-MML approach
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    In this work, we perform the clustering of European regions, based on their citizens’ political interests and electoral participation, as expressed in data from the two most recent European Social Surveys (ESS) - 2012 (round 6) and 2014 (round 7). We resort to a new clustering approach, named EM-MML, which clusters categorical data and simultaneously determines the number of clusters. Clustering is applied to sets of questions referring to whether the citizens were involved in “different ways of trying to improve things in their country or help prevent things from going wrong” – e.g., signed a petition or worked in a political organisation or association. The results of the EM-MML approach are compared with results from the classical EM approach combined with several information criteria. EM-MML approach provides more parsimonious and robust solutions than those obtained by standard EM and it is also faster than the other methods considered, which is especially relevant when dealing with large data sets.
  • An MML embedded approach for estimating the number of clusters
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    Assuming that the data originate from a finite mixture of multinomial distributions, we study the performance of an integrated Expectation Maximization (EM) algorithm considering Minimum Message Length (MML) criterion to select the number of mixture components. The referred EM-MML approach, rather than selecting one among a set of pre-estimated candidate models (which requires running EM several times), seamlessly integrates estimation and model selection in a single algorithm. Comparisons are provided with EM combined with well-known information criteria – e.g. the Bayesian information Criterion. We resort to synthetic data examples and a real application. The EM-MML computation time is a clear advantage of this method; also, the real data solution it provides is more parsimonious, which reduces the risk of model order overestimation and improves interpretability.
  • The number of clusters on trust
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    In this work we analyse the perfomance of a new Expectation Maximization (EM) clustering approach. This method is based on the Minimum Message Lenght (MML) criterion and simultaneously yields clustering of categorical data and the number of clusters. We group European citizens based on their trust in institutions, using Europen Social Survey data. The results obtained illustrate the parsimony, the cohesion-separation and stability of the EM-MML solutions, when compared to traditional information criteria EM based approaches.
  • Categorical data clustering using a minimum message length criterion
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.