Repository logo
 
Publication

Text classification using compression-based dissimilarity measures

dc.contributor.authorCoutinho, David Pereira
dc.contributor.authorFigueiredo, Mário A. T.
dc.date.accessioned2016-05-03T10:47:14Z
dc.date.available2016-05-03T10:47:14Z
dc.date.issued2015-08
dc.description.abstractArguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.pt_PT
dc.identifier.citationCOUTINHO, David Pereira; FIGUEIREDO, Mário A. T. - Text classification using compression-based dissimilarity measures. International Journal of Pattern Recognition and Artificial Intelligence. ISSN 0218-0014. Vol. 23 (2015), pp. 1-18pt_PT
dc.identifier.doi10.1142/S0218001415530043pt_PT
dc.identifier.issn0218-0014
dc.identifier.issn1793-6381
dc.identifier.urihttp://hdl.handle.net/10400.21/6144
dc.language.isoengpt_PT
dc.peerreviewedyespt_PT
dc.publisherWorld Scientific Publications CO PTE LTDpt_PT
dc.relation.ispartofseries1553004
dc.subjectText classificationpt_PT
dc.subjectText similarity measurespt_PT
dc.subjectRelative entropypt_PT
dc.subjectZiv-Merhav methodpt_PT
dc.subjectCross-parsing algorithmpt_PT
dc.titleText classification using compression-based dissimilarity measurespt_PT
dc.typejournal article
dspace.entity.typePublication
oaire.citation.endPage18
oaire.citation.issue5pt_PT
oaire.citation.startPage1
oaire.citation.titleInternational Journal of Pattern Recognition and Artificial Intelligencept_PT
oaire.citation.volume29pt_PT
rcaap.rightsclosedAccesspt_PT
rcaap.typearticlept_PT

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Text classification using compresion-based dissimilarity measures.pdf
Size:
8.8 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: