Repository logo
 
Loading...
Profile Picture

Search Results

Now showing 1 - 5 of 5
  • Feature selection for clustering categorical data with an embedded modelling approach
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
  • A clustering view on ESS measures of political interest: an EM-MML approach
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    In this work, we perform the clustering of European regions, based on their citizens’ political interests and electoral participation, as expressed in data from the two most recent European Social Surveys (ESS) - 2012 (round 6) and 2014 (round 7). We resort to a new clustering approach, named EM-MML, which clusters categorical data and simultaneously determines the number of clusters. Clustering is applied to sets of questions referring to whether the citizens were involved in “different ways of trying to improve things in their country or help prevent things from going wrong” – e.g., signed a petition or worked in a political organisation or association. The results of the EM-MML approach are compared with results from the classical EM approach combined with several information criteria. EM-MML approach provides more parsimonious and robust solutions than those obtained by standard EM and it is also faster than the other methods considered, which is especially relevant when dealing with large data sets.
  • Model selection in discrete clustering: the EM-MML algorithm
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    Finite mixture models are widely used for cluster analysis in several areas of application. They are commonly estimated through likelihood maximization (using diverse variants of the expectation-maximization algorithm) and the number of components (or clusters) is determined resorting to information criteria: the EM algorithm is run several times and then one of the pre-estimated candidate models is selected (e.g. using the BIC criterion). We propose a new clustering approach to deal with the clustering of categorical data (quite common in social sciences) and simultaneously identify the number of clusters - the EM-MML algorithm. This approach assumes that the data comes from a finite mixture of multinomials and uses a variant of EM to estimate the model parameters and a minimum message length (MML) criterion to estimate the number of clusters. EM-MML thus seamlessly integrates estimation and model selection in a single algorithm. The EM-MML is compared with traditional EM approaches, using alternative information criteria. Comparisons rely on synthetic datasets and also on a real dataset (data from the European Social Survey). The results obtained illustrate the parsimony of the EM-MML solutions as well as their clusters cohesion-separation and stability. A clear advantage of EM-MML is also the computation time.
  • Clustering and selecting categorical features
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.
  • The number of clusters on trust
    Publication . Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário
    In this work we analyse the perfomance of a new Expectation Maximization (EM) clustering approach. This method is based on the Minimum Message Lenght (MML) criterion and simultaneously yields clustering of categorical data and the number of clusters. We group European citizens based on their trust in institutions, using Europen Social Survey data. The results obtained illustrate the parsimony, the cohesion-separation and stability of the EM-MML solutions, when compared to traditional information criteria EM based approaches.