Using Genetic Algorithms for Feature Selection and Weighting in Character Recognition System

Nawwaf Kharma, Faten Hussein, Rabab Ward

1 Introduction

Computer-based pattern recognition is a process that involves several sub-processes, including pre-processing, feature extraction, classification, and post-processing (Kharma & Ward, 1999). Pre-processing encompasses all those functions that prepare an input pattern for effective and efficient extraction of relevant features. Feature extraction is the measurement of certain attributes of the target pattern (e.g., the coordinates of the centre of gravity). Classification utilizes the values of these attributes to assign a class to the input pattern. In our view, the selection and weighting of the right set of features is the hardest part of building a pattern recognition system. The ultimate aim of our research work is the automation of the process of feature selection and weighting, within the context of character/symbol recognition systems. Our chosen method of automation is Genetic Algorithms (see section 1.3 for justification).

Genetic Algorithms (GAs) have been used for feature selection and weighing in many pattern recognition applications (e.g., texture classification and medical diagnostics). However, their use in feature selection (let alone weighting) in character recognition applications has been infrequent. This fact is made clear in section 2. Recently, the authors have demonstrated that GAs can, in principle, be used to configure the real-valued weights of a classifier component of a character recognition system in a near-optimal way (Hussein, Kharma & Ward, 2001). This study subsumes and further expands upon that effort.

Here, we carry out two sets of studies, which in turn produce some unexpected but justified results. The first set (section 4.1) compares the performance of GA-based feature selection to GA-based feature weighting, under various conditions. The second set of studies (section 4.2) evaluate the performance of the better method (which turns out to be feature selection) in terms of optimality and time. The penultimate part of this paper (section 5) summarizes the lessons learnt from this research effort. The most important conclusions are that (a) feature set selection (or pruning) prior to classification is essential for k-nearest neighbour classifiers, in the presence of redundant or irrelevant features; and (b) that GAs are effective methods for feature selection: they (almost always) find the optimal feature subsets and do so within a small fraction of the time required for an exhaustive search. The question of how well our method will scale-up to highly dimensional feature spaces remains an open problem. This, as well as other problems appropriate for future research, are listed in section 6.

The following sections (1.1-2) provide the technical reader with an introduction to two directly related areas necessary for the appreciation of the rest of the paper.

1.1 Instance Based Learning Algorithms

Instance based learning algorithms are a class of supervised machine learning algorithms. These algorithms do not construct abstract concepts, but rather base their classification of new instances on their similarity to specific training instances (Aha, 1992). Old training instances are stored in memory, and classification is postponed until new instances are received by the classifier. When a new instance is received, older instances similar in some respects to it are retrieved from memory and used to classify the new instance. Instance based learning algorithms have the advantages of being able to (a) learn complex target concepts (e.g. functions); and (b) estimate target concepts distinctly for each new instance. In addition, their training is very fast and simple: it only requires storing all the training instances in memory. In contrast, the cost of classifying new instances can be high because every new instance is compared to every training instance. Hence, efficient indexing of training instances is important. Another disadvantage of these learning algorithms is that their classification accuracy degrades significantly in the presence of noise (in training instances).

One well-known and used instance based algorithm is the k-nearest neighbour algorithm of (Dasarathy, 1991). The function it uses for measuring similarity between two instances is based on Euclidean distance. This is described thus:

D(x,y) = (Eq.1)

Where D is distance, x and y are two instances, and are the i-th attribute for the x and y instances, and n is the total number of features. To compensate for the difference in units between features, normalization should be performed. This often scales all features to a range between 0 and 1, inclusive.

1.2 Feature Selection and Feature Weighting

One major drawback of the Euclidean distance function is its sensitivity to the presence of noise, and particularly, redundant or irrelevant features. This is because it treats all features of an instance (relevant or not) as equally important to its successful classification. A possible remedy is to assign weights to features. The weights can then be used to reflect the relative relevance of their respective features to correct classification. Highly relevant features would be assigned high weights relative to the weights of redundant or irrelevant features. Taking that into account, the Euclidean distance measure can be refined:

D(x,y) = (Eq.2)

where is the weight of the i-th feature.

In feature weighting, the weights can hold any value in a continuous range of values (e.g. [0,1]). The purpose of feature weighting is to find a vector of real-valued weights that would optimize classification accuracy of some classification or recognition system. Feature selection is different. Given a set of n features, feature selection aims to find a subset of m features (where m<n) that gives the highest classification accuracy. This means that weights can either equal '0' for 'not selected', or '1' for 'selected'. Though both feature selection and weighting seek to enhance classification accuracy, only feature selection has the (real) potential of reducing problem dimensionality (by assigning '0' weights to features). This is contrary to feature weighting, where irrelevant/redundant features are almost always assigned small (but still non-zero) weights. This can also enhance classification accuracy as a result of completely eliminating highly irrelevant and redundant features.

Nevertheless, feature selection may be regarded as a special case of feature weighting. The trick is to find a way to use feature weighting to both (a) assign weights to relevant features, in a way that reflects their relative relevance, and (b) completely eliminate highly irrelevant and redundant features from the original set of candidate features. We discuss such a method in section 4.1.1. However, the results obtained have not been conclusive (either for or against the proposed method).

2 Literature Review

Though (Siedlecki & Sklansky, 1988) is probably the first paper to suggest the use of genetic algorithms for feature selection, several other researchers have, in fact, used them for feature selection. However, there are rare examples in the literature of character recognition applications that employ GA to satisfy their feature selection needs. This fact becomes particularly pronounced when one looks at the steadily increasing number of GA-based feature selection (GFS) and weighting (GFW) applications in pattern classification domains. Below is a list of these studies, classified according to the type of input (printed or handwritten). We also include GFS applications to signature verification due to the similarity of the problem characteristics.

2.1 Recognition of Printed Characters

(Smith, Fogarty & Johnson, 1994) applied GFS to printed letter recognition. They used 24 features (16 features, plus 8 redundant features) to describe the 26 letters of the English alphabet. A nearest neighbour classifier using Euclidian distance was used for classification. To speed the GA run, a sub-sampling method is used where 10% of the training data is randomly sampled and selected at the beginning of each run; only this subset is used for the GA evaluation. The system reduces the feature set to 10 features. It does so while maintaining a mean error rate lower than that generated when using all 24 features.

2.2 Recognition of Handwritten Characters

A handwritten digit recognizer, which is presented by (Kim & Kim, 2000), is used to assess a GFS algorithm. During the training phase, the recognizer performs clustering to obtain a KxP dimension codebook (K is the number of clusters and P is the number of features) that represents the centroid of the K clusters. During the testing phase, a matching process performs a distance calculation between the centroids and the testing data. The objective is to use GFS to speed up the matching process as well as to reduce the size of the codebook. Testing was carried out using two datasets: one with 74 features and another with 416 features. In the 74-feature test, experimental results show a trivial decrease in the recognition when the number of features is lowered. However, in the 416-features test, the GA-selected set of features leads to a higher recognition rate than the original set does. This result emphasizes the usefulness of GFS in large search spaces.

In addition, Kim at al, propose a variable weight method to assign weights to features in the matching process. During GA feature selection, a weight matrix for the features is built, which represents how often the feature is selected throughout the GFS. After the GFS is complete, the matrix is used in the recognition module. Features having high weights denote more frequently selected features, which implies that they are more important (i.e. relevant to classification) than low weighted features. Results using this variable weight method show a slight improvement in performance over the un-weighted method. One important observation is that this method of weighting features is completely different than the one we use here. Their method depends on counting the frequencies selected features during GFS, while in our approach the weights are assigned using the GA itself. A major drawback to the weight matrix method is that it does not achieve any reduction in dimensionality, so the number of features remains the same. Also, the enhancement in accuracy achieved is very small, where as the non-weighted method has an accuracy of 96.3% the suggested weighting method has an accuracy of 96.4%.

Furthermore, (Gaborski & Anderson, 1993) use a GA for feature selection for a hand-written digit recognition system. They used several variations for population organization, parent selection and child creation. The result is that the GA was capable of pruning the feature set from 1500 to 300, while maintaining the same level of accuracy achieved in the original set. Moreover, (Shi, Shu & Liu, 1998) suggest a GFS for handwritten Chinese character recognition. They craft a fitness function that is based on the transformed divergence among classes, which is derived from Mahalanobis distance. The goal is to select a subset of m features from the original set of n features (where m<n) for which the error is minimized. Starting with 64 features, the algorithm is able to reach up to 26 features with a lower error rate than the original feature set.

Finally, (Moser & Murty, 2000) investigate the use of Distributed Genetic Algorithms in very large-scale feature selection (where the number of features is larger than 500). Starting with 768 initial features, a 1-nearset-neighbour classifier is used to successfully recognize handwritten digits using 30 SUN workstations. The fitness function used is a polynomial punishment function, which utilizes both classification accuracy and the number of selected features. The punishment factor is used to guide the search towards regions of lower complexity. The experiments are aimed of demonstrating the scalability of GFS to very large domains. The researchers were able to reduce the number of features by approximately 50% while having classification accuracies comparable to those of the full feature set.

2.3 Signature Verification

(Fung, Liu & Lau, 1996) use GA to reduce the number of features required to achieve a minimum acceptable hit-rate, in a signature verification system. The goal is to search for a minimum number of features, which would not degrade the classifier's performance beyond a certain minimum limit. They use the same fitness function as proposed by (Siedlecki & Sklansky, 1989), which is based on a penalty formula and the number of features selected. The penalty function apportions punishment values to feature sets that produce error rates, which are greater than a pre-defined threshold. Using a 91-feature set to describe 320 handwritten signatures from 32 different persons, the system was able to achieve an accuracy of 93% with only 7 selected features as opposed to a 88.4% accuracy using the whole 91 feature set.

Several points are clear from the above review (a) Genetic Algorithms are effective tools for reducing the feature-dimensionality of character (or signature) recognition problems, while maintaining a high level of accuracy (of recognition); (b) GFS has only been used off-line due to the time it takes to run a GA in the wrapper approach; and finally, (c) there is no published work that focuses on the more complicated problem of automatic feature weighting using GA for character recognition applications.

3 Approach and Platform

3.1 Conceptual Approach

Several feature weighting methods for instance based learning algorithms exist. For example, the weighting can be global, meaning that there is a single weight set for the classification task, or it can be local, in which weights vary over local regions of the instance space (Howe & Claire, 1997). Moreover, the weights can have continuous real values or binary values. In addition, the method for assigning weights can be guided by the classifier performance (the wrapper approach) or not (the filter approach). For an extensive review, (Wettschereck, Aha & Mohri, 1997) provide a five-dimensional framework that categorizes different feature weighting methods. Also, (Dash & Liu, 1997) categorize different feature selection methods according to the search technique used and the evaluation function.

We are obliged to use a wrapper or embedded approach, despite their relative computational inefficiency, because the classifier is needed to determine the relevancy of features. In any case, a classifier is necessary, and the k nearest neighbour (kNN) classifier is our first choice because of its excellent asymptotic accuracy, simplicity, speed of training, and its wide use (by researchers in the area). GAs are appropriate for feature selection and weighting. GAs are suitable for large-scale, non-linear problems that involve systems which are vaguely defined. Further, character recognition systems (a) often use a large number of features, (b) exhibit a high degree of inter-feature dependency, and (c) are hard, if not impossible, to define analytically. Recent empirical evidence shows that GAs are more effective than either sequential forward and backward floating search techniques in small and medium scale problems (Kudo & Skalansky, 2000).

We built a pattern recognition experimental bench with the following modules: (a) an evaluation module; (b) a classifier module; (c) a feature extraction and weighting module (FEW); and (d) a GA optimization module. The evaluation module is essentially the fitness function. The classifier module implements the kNN classifier. The FEW module applies certain functions that measure a select set of features, and can assign weights to the selected features, before presenting them to the classification module. The FEW module is configured by the last (and most important) module: the GA optimizer (or simply GA). The purpose of the GA is to find a set of features and associated weights that will optimize the (overall) performance of the pattern recognition system, as measured by a given fitness function.

3.2 Experimental Platform

The experimental work described in the following sections is implemented using publicly accessible character databases and GAs. This makes it possible for researchers to replicate our work and attempt to improve upon our results. The experiments were run on a Pentium III (500MHz) PC with 128MB of RAM and running Windows 98.

The Machine Learning Data Repository at the University of California at Irvine (Murphy & Aha, 1994) is our source of databases. We used three different handwritten digits databases:

· "Optical Recognition of Handwritten Digits" database (or DB1), which consists of 64 features.

· "Pen-Based Recognition of Handwritten Digits" (or DB2), which consists of 16 features.

· "Multiple Features Database" (or DB3), which is divided into six feature sets. Of those, we have only used the last two feature sets, which contain 47 and 6 features, respectively.

The Genetic Algorithm used for optimization is the Simple Genetic Algorithm (or SGA) described by Goldberg (Goldberg, 1989). The actual software implementation used comes from the "Galib" GA library provided by the Massachusetts Institute of Technology (Matthew, 1999). In this SGA we used non-overlapping populations, roulette-wheel selection with a degree of elitism, as well as two-point crossover. The GA parameters (unless stated otherwise, below) are crossover probability Pc of 0.9. As for mutation, we used two styles: flip mutation for GFS, and Gaussian mutation for GFW. Gaussian mutation uses a bell-curve around the mutated value to determine the random new value. Under this bell-shaped area, values that are closer to the current value are more likely to be selected than values that are farther away. The mutation probability Pm was0.02. The number of generations Ng was 50, and the population size Pop was also 50. The fitness function was the classification accuracy of our own 1-nearest-neighbour (1-NN) classifier.

Page 1, Page 2, Page 3, Page 4

DEC Lab Home

Last Updated April 6, 2004