Loading...
8 results
Search Results
Now showing 1 - 8 of 8
- On the suitability of suffix arrays for lempel-ziv data compressionPublication . J. Ferreira, Artur; Oliveira, Arlindo L.; Figueiredo, MarioLossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical.
- Suffix Arrays: A competitive choice for fast Lempel-Ziv CompressionPublication . J. Ferreira, Artur; Oliveira, Arlindo L.; Figueiredo, MarioLossless compression algorithms of the Lempel-Ziv (LZ) family are widely used in a variety of applications. The LZ encoder and decoder exhibit a high asymmetry, regarding time and memory requirements, with the former being much more demanding. Several techniques have been used to speed up the encoding process; among them is the use of suffix trees. In this paper, we explore the use of a simple data structure, named suffix array, to hold the dictionary of the LZ encoder, and propose an algorithm to search the dictionary. A comparison with the suffix tree based LZ encoder is carried out, showing that the compression ratios are roughly the same. The ammount of memory required by the suffix array is fixed, being much lower than the variable memory requirements of the suffix tree encoder, which depends on the text to encode. We conclude that suffix arrays are a very interesting option regarding the tradeoff between time, memory, and compression ratio, when compared with suffix trees, that make them preferable in some compression scenarios.
- On the improvement of feature selection techniques: the fitness filterPublication . J. Ferreira, Artur; Figueiredo, MarioThe need for feature selection (FS) techniques is central in many machine learning and pattern recognition problems. FS is a vast research field and therefore we now have many FS techniques proposed in the literature, applied in the context of quite different problems. Some of these FS techniques follow the relevance-redundancy (RR) framework to select the best subset of features. In this paper, we propose a supervised filter FS technique, named as fitness filter, that follows the RR framework and uses data discretization. This technique can be used directly on low or medium dimensional data or it can be applied as a post-processing technique to other FS techniques. Specifically, when used as a post-processing technique, it further reduces the dimensionality of the feature space found by common FS techniques and often improves the classification accuracy.
- Hybrid generative/discriminative training of radial basis function networksPublication . J. Ferreira, Artur; Figueiredo, MarioWe propose a new training algorithm for radial basics function networks (RBFN), which incorporates both generative (mixture-based) and discriminative (logistic) criteria. Our algorithm incorporates steps from the classical expectation-maximization algorithm for mixtures of Gaussians with a logistic regression step to update (in a discriminative way) the output weights. We also describe an incremental version of the algorithm, which is robust regarding initial conditions. Comparison of our approach with existing training algorithms, on (both synthetic and real) binary classification problems, shows that it achieves better performance.
- Does nonlinear modeling play a role in plasmid bioprocess monitoring using fourier transform infrared spectra?Publication . B. Lopes, Marta; Calado, Cecília; Figueiredo, Mario; Bioucas-Dias, JoseThe monitoring of biopharmaceutical products using Fourier transform infrared (FT-IR) spectroscopy relies on calibration techniques involving the acquisition of spectra of bioprocess samples along the process. The most commonly used method for that purpose is partial least squares (PLS) regression, under the assumption that a linear model is valid. Despite being successful in the presence of small nonlinearities, linear methods may fail in the presence of strong nonlinearities. This paper studies the potential usefulness of nonlinear regression methods for predicting, from in situ near-infrared (NIR) and mid-infrared (MIR) spectra acquired in high-throughput mode, biomass and plasmid concentrations in Escherichia coli DH5-α cultures producing the plasmid model pVAX-LacZ. The linear methods PLS and ridge regression (RR) are compared with their kernel (nonlinear) versions, kPLS and kRR, as well as with the (also nonlinear) relevance vector machine (RVM) and Gaussian process regression (GPR). For the systems studied, RR provided better predictive performances compared to the remaining methods. Moreover, the results point to further investigation based on larger data sets whenever differences in predictive accuracy between a linear method and its kernelized version could not be found. The use of nonlinear methods, however, shall be judged regarding the additional computational cost required to tune their additional parameters, especially when the less computationally demanding linear methods herein studied are able to successfully monitor the variables under study.
- Feature transformation and reduction for text classificationPublication . J. Ferreira, Artur; Figueiredo, MarioText classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.
- Boosting of (very) weak learnersPublication . J. Ferreira, Artur; Figueiredo, MárioIn this paper we apply boosting to weak (binary) learners. The main idea is to combine the output of several simple learners in order to obtain a better classifier. As weak learners we consider generative classifiers and radial basis function classifiers.Our tests on synthetic data show that the proposed algorithm has good convergence properties. On benchmark data, boosting of these weak learners attains results close to the well-known Real AdaBoost algorithm (with decision trees) and support vector machines, constituting a low complexity competitive choice.
- On the use of suffix arrays for memory-efficient Lempel-Ziv data compressionPublication . J. Ferreira, Artur; Oliveira, Arlindo L.; Figueiredo, MarioThe Lempel-Ziv 77 (LZ77) and LZ-Storer-Szymanski (LZSS) text compression algorithms use a sliding window over the sequence of symbols, with two sub-windows: the dictionary (symbols already encoded) and the look-ahead-buffer (LAB) (symbols not yet encoded). Binary search trees and suffix trees (ST) have been used to speedup the search of the LAB over the dictionary, at the expense of high memory usage [1]. A suffix array (SA) is a simpler, more compact data structure which uses (much) less memory [2,3] to hold the same information. The SA for a length m string is an array of integers ([1], ...[k], ...a[m]) that stores the lexicographic order of suffix k of the string; sub-string searching, as used in LZ77/LZSS, is done by searching the SA.