Feature transformation and reduction for text classification

J. Ferreira, Artur; Figueiredo, Mario

http://hdl.handle.net/10400.21/17914

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
Feature_AJFerreira.pdf		111.12 KB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

J. Ferreira, Artur

Figueiredo, Mario

Resumo(s)

Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.

Palavras-chave

text classification bag-of-words (BoW)

URI

http://hdl.handle.net/10400.21/17914

Citação

Ferreira, A., Figueiredo, M. – Feature Transformation and Reduction for Text Classification. In 10th International Workshop on Pattern Recognition in Information Systems - PRIS 2010, in conjunction with ICEIS 2010. Funchal, Portugal: SciTePress, 2010, ISBN 978-989-8425-14-0. Pp. 72-81

Projetos de investigação

Sem título

Projeto de investigaçãoVer mais

Sem título

Projeto de investigaçãoVer mais

Editora

SciTePress

Coleções

ISEL - Eng. Elect. Tel. Comp. - Comunicações

Ver registo completo