Repository logo
 
No Thumbnail Available
Publication

Text classification using compression-based dissimilarity measures

Use this identifier to reference this record.

Advisor(s)

Abstract(s)

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

Description

Keywords

Text classification Text similarity measures Relative entropy Ziv-Merhav method Cross-parsing algorithm

Citation

COUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-18

Research Projects

Organizational Units

Journal Issue

Publisher

World Scientific Publications CO PTE LTD

CC License

Altmetrics