Using Genetic Algorithms for Feature Selection and Weighting in Character
Recognition System
Nawwaf Kharma, Faten Hussein, Rabab Ward
1
Introduction
Computer-based
pattern recognition is a process that involves several
sub-processes, including pre-processing, feature extraction, classification, and post-processing (Kharma
& Ward, 1999). Pre-processing encompasses all those functions that
prepare an input pattern for effective and efficient extraction of
relevant features. Feature extraction is the measurement of certain
attributes of the target pattern (e.g., the coordinates of the
centre of gravity). Classification utilizes the values of these
attributes to assign a class to the input pattern. In our view, the
selection and weighting of the right set of features is the hardest
part of building a pattern recognition system. The ultimate aim of
our research work is the automation of the process of feature
selection and weighting, within the context of character/symbol
recognition systems. Our chosen method of automation is Genetic
Algorithms (see section 1.3 for justification).
Genetic
Algorithms (GAs) have been used for feature selection and weighing
in many pattern recognition applications (e.g., texture
classification and medical diagnostics). However, their use in
feature selection (let alone weighting) in character recognition
applications has been infrequent. This fact is made clear in section
2. Recently, the authors have demonstrated that GAs can, in
principle, be used to configure the real-valued weights of a
classifier component of a character recognition system in a
near-optimal way (Hussein, Kharma & Ward, 2001). This study
subsumes and further expands upon that effort.
Here, we carry
out two sets of
studies, which in turn produce some unexpected but justified
results. The first set (section 4.1) compares the performance of
GA-based feature selection to GA-based feature weighting, under
various conditions. The second set of studies (section 4.2) evaluate
the performance of the better method (which turns out to be feature
selection) in terms of optimality and time. The penultimate part of
this paper (section 5) summarizes the lessons learnt from this
research effort. The most important conclusions are that (a) feature
set selection (or pruning) prior to classification is essential for
k-nearest neighbour classifiers, in the presence of redundant or
irrelevant features; and (b) that GAs are effective methods for
feature selection: they (almost always) find the optimal feature
subsets and do so within a small fraction of the time required for
an exhaustive search. The question of how well our method will
scale-up to highly dimensional feature spaces remains an open
problem. This, as well as other problems appropriate for future
research, are listed in section 6.
The following sections (1.1-2) provide the technical reader with an
introduction to two directly related areas necessary for the
appreciation of the rest of the paper.
1.1
Instance Based Learning
Algorithms
Instance based
learning algorithms are a class of supervised machine learning
algorithms. These algorithms do not construct abstract concepts, but
rather base their classification of new instances on their
similarity to specific training instances (Aha, 1992). Old training
instances are stored in memory, and classification is postponed
until new instances are received by the classifier. When a new
instance is received, older instances similar in some respects to it
are retrieved from memory and used to classify the new instance.
Instance based learning algorithms have the advantages of being able
to (a) learn complex target concepts (e.g. functions); and (b)
estimate target concepts distinctly for each new instance. In
addition, their training is very fast and simple: it only requires
storing all the training instances in memory. In contrast, the cost
of classifying new instances can be high because every new instance
is compared to every training instance. Hence, efficient indexing of
training instances is important. Another disadvantage of these
learning algorithms is that their classification accuracy degrades
significantly in the presence of noise (in training instances).
One well-known
and used instance based algorithm is the k-nearest neighbour
algorithm of (Dasarathy,
1991).
The function it uses for measuring similarity between two instances
is based on Euclidean distance. This is described thus:
D(x,y) = (Eq.1)
Where D is
distance, x and y are two instances,
and are the i-th attribute for the x and y
instances, and n is the total number of features. To
compensate for the difference in units between features,
normalization should be performed. This often scales all features to
a range between 0 and 1, inclusive.
1.2
Feature Selection and Feature
Weighting
One
major drawback of the Euclidean distance function is its sensitivity to the
presence of noise, and particularly, redundant or irrelevant features. This is
because it treats all features of an instance (relevant or not) as equally
important to its successful classification. A possible remedy is to assign
weights to features. The weights can then be used to reflect the relative
relevance of their respective features to correct classification. Highly
relevant features would be assigned high weights relative to the weights of
redundant or irrelevant features. Taking that into account, the Euclidean
distance measure can be refined:
D(x,y) = (Eq.2)
where
is the weight of the i-th feature.
In feature
weighting, the weights can hold any value in a continuous range of
values (e.g. [0,1]). The purpose of feature weighting is to find a
vector of real-valued weights that would optimize classification
accuracy of some classification or recognition system. Feature
selection is different. Given a set of n features, feature
selection aims to find a subset of m features (where m<n)
that gives the highest classification accuracy. This means that
weights can either equal '0' for 'not selected', or '1' for
'selected'. Though both feature selection and weighting seek to
enhance classification accuracy, only feature selection has the
(real) potential of reducing problem dimensionality (by assigning
'0' weights to features). This is contrary to feature weighting,
where irrelevant/redundant features are almost always assigned small
(but still non-zero) weights. This can also enhance classification
accuracy as a result of completely eliminating highly irrelevant and
redundant features.
Nevertheless, feature selection may be regarded as
a special case of feature weighting. The trick is to find a way to
use feature weighting to both (a) assign weights to relevant
features, in a way that reflects their relative relevance, and (b)
completely eliminate highly irrelevant and redundant features from
the original set of candidate features. We discuss such a method in
section 4.1.1. However, the results obtained have not been
conclusive (either for or against the proposed method).
2
Literature Review
Though (Siedlecki
& Sklansky, 1988) is probably the first paper to suggest the use of
genetic algorithms for feature selection, several other researchers
have, in fact, used them for feature selection. However, there are
rare examples in the literature of character recognition
applications that employ GA to satisfy their feature selection
needs. This fact becomes particularly pronounced when one looks at
the steadily increasing number of GA-based feature selection (GFS)
and weighting (GFW) applications in pattern classification domains.
Below is a list of these studies, classified according to the type
of input (printed or handwritten). We also include GFS applications
to signature verification due to the similarity of the problem
characteristics.
2.1
Recognition of Printed Characters
(Smith, Fogarty &
Johnson, 1994) applied GFS to printed letter recognition. They used
24 features (16 features, plus 8 redundant features) to describe the
26 letters of the English alphabet. A nearest neighbour classifier
using Euclidian distance was used for classification. To speed the
GA run, a sub-sampling method is used where 10% of the training data
is randomly sampled and selected at the beginning of each run; only
this subset is used for the GA evaluation. The system reduces the
feature set to 10 features. It does so while maintaining a mean
error rate lower than that generated when using all 24 features.
2.2
Recognition of Handwritten Characters
A handwritten
digit recognizer, which is presented by (Kim & Kim, 2000), is used
to assess a GFS algorithm. During the training phase, the recognizer
performs clustering to obtain a KxP dimension codebook (K
is the number of clusters and P is the number of features)
that represents the centroid of the K clusters. During the
testing phase, a matching process performs a distance calculation
between the centroids and the testing data. The objective is to use
GFS to speed up the matching process as well as to reduce the size
of the codebook. Testing was carried out using two datasets: one
with 74 features and another with 416 features. In the 74-feature
test, experimental results show a trivial decrease in the
recognition when the number of features is lowered. However, in the
416-features test, the GA-selected set of features leads to a higher
recognition rate than the original set does. This result emphasizes
the usefulness of GFS in large search spaces.
In addition,
Kim at al, propose a variable weight method to assign weights to
features in the matching process. During GA feature selection, a
weight matrix for the features is built, which represents how often
the feature is selected throughout the GFS. After the GFS is
complete, the matrix is used in the recognition module. Features
having high weights denote more frequently selected features, which
implies that they are more important (i.e. relevant to
classification) than low weighted features. Results using this
variable weight method show a slight improvement in performance over
the un-weighted method. One important observation is that this
method of weighting features is completely different than the one we
use here. Their method depends on counting the frequencies selected
features during GFS, while in our approach the weights are assigned
using the GA itself. A major drawback to the weight matrix method is
that it does not achieve any reduction in dimensionality, so the
number of features remains the same. Also, the enhancement in
accuracy achieved is very small, where as the non-weighted method
has an accuracy of 96.3% the suggested weighting method has an
accuracy of 96.4%.
Furthermore, (Gaborski
& Anderson, 1993) use a GA for feature selection for a hand-written
digit recognition system. They used several variations for
population organization, parent selection and child creation. The
result is that the GA was capable of pruning the feature set from
1500 to 300, while maintaining the same level of accuracy achieved
in the original set. Moreover, (Shi, Shu & Liu, 1998) suggest
a GFS for handwritten Chinese character recognition. They craft a
fitness function that is based on the transformed divergence among
classes, which is derived from Mahalanobis distance. The goal is to
select a subset of m features from the original set of n
features (where m<n) for which the error is minimized.
Starting with 64 features, the algorithm is able to reach up to 26
features with a lower error rate than the original feature set.
Finally,
(Moser & Murty, 2000) investigate the use of Distributed Genetic
Algorithms in very large-scale feature selection (where the number
of features is larger than 500). Starting with 768 initial features,
a 1-nearset-neighbour classifier is used to successfully recognize
handwritten digits using 30 SUN workstations. The fitness function
used is a polynomial punishment function, which utilizes both
classification accuracy and the number of selected features. The
punishment factor is used to guide the search towards regions of
lower complexity. The experiments are aimed of demonstrating the
scalability of GFS to very large domains. The researchers were able
to reduce the number of features by approximately 50% while having
classification accuracies comparable to those of the full feature
set.
2.3 Signature
Verification
(Fung, Liu & Lau, 1996) use GA to
reduce the number of features required to achieve a minimum acceptable
hit-rate, in a signature verification system. The goal is to search for a
minimum number of features, which would not degrade the classifier's
performance beyond a certain minimum limit. They use the same fitness function
as proposed by (Siedlecki & Sklansky, 1989), which is based on a penalty
formula and the number of features selected. The penalty function apportions
punishment values to feature sets that produce error rates, which are greater
than a pre-defined threshold. Using a 91-feature set to describe 320
handwritten signatures from 32 different persons, the system was able to
achieve an accuracy of 93% with only 7 selected features as opposed to a 88.4%
accuracy using the whole 91 feature set.
Several points
are clear from the above review (a) Genetic Algorithms are effective
tools for reducing the feature-dimensionality of character (or
signature) recognition problems, while maintaining a high level of
accuracy (of recognition); (b) GFS has only been used off-line due
to the time it takes to run a GA in the wrapper approach; and
finally, (c) there is no published work that focuses on the more
complicated problem of automatic feature weighting using GA for
character recognition applications.
3
Approach and Platform
3.1 Conceptual Approach
Several feature weighting methods for instance based learning
algorithms exist. For example, the weighting can be global, meaning
that there is a single weight set for the classification task, or it
can be local, in which weights vary over local regions of the
instance space (Howe & Claire, 1997). Moreover, the weights can have
continuous real values or binary values. In addition, the method for
assigning weights can be guided by the classifier performance (the
wrapper approach) or not (the filter approach). For an
extensive review, (Wettschereck, Aha & Mohri, 1997) provide a
five-dimensional framework that categorizes different feature
weighting methods. Also, (Dash & Liu, 1997) categorize different
feature selection methods according to the search technique used and
the evaluation function.
We are obliged
to use a wrapper or embedded approach, despite their relative
computational inefficiency, because the classifier is needed to
determine the relevancy of features. In any case, a classifier
is necessary, and the k nearest neighbour (kNN) classifier is our
first choice because of its excellent asymptotic accuracy,
simplicity, speed of training, and its wide use (by researchers in
the area). GAs are appropriate for feature selection and weighting.
GAs are suitable for large-scale, non-linear problems that involve
systems which are vaguely defined. Further, character recognition
systems (a) often use a large number of features, (b) exhibit a high
degree of inter-feature dependency, and (c) are hard, if not
impossible, to define analytically. Recent empirical evidence shows
that GAs are more effective than either sequential forward and
backward floating search techniques in small and medium scale
problems (Kudo & Skalansky, 2000).
We built a
pattern recognition experimental bench with the following modules:
(a) an evaluation module; (b) a classifier module; (c) a feature
extraction and weighting module (FEW); and (d) a GA optimization
module. The evaluation module is essentially the fitness function.
The classifier module implements the kNN classifier. The FEW module
applies certain functions that measure a select set of features, and
can assign weights to the selected features, before presenting them
to the classification module. The FEW module is configured by the
last (and most important) module: the GA optimizer (or simply GA).
The purpose of the GA is to find a set of features and associated
weights that will optimize the (overall) performance of the pattern
recognition system, as measured by a given fitness function.
3.2 Experimental Platform
The
experimental work described in the following sections is implemented
using publicly accessible character databases and GAs. This makes it
possible for researchers to replicate our work and attempt to
improve upon our results. The experiments were run on a Pentium III
(500MHz) PC with 128MB of RAM and running Windows 98.
The Machine
Learning Data Repository at the University of California at Irvine
(Murphy & Aha, 1994) is our source of databases. We used three
different handwritten digits databases:
·
"Optical Recognition of Handwritten Digits" database (or DB1), which
consists of 64 features.
·
"Pen-Based Recognition of Handwritten Digits" (or DB2), which
consists of 16 features.
·
"Multiple Features Database"
(or DB3), which is divided into six feature sets. Of those, we have
only used the last two feature sets, which contain 47 and 6
features, respectively.
The Genetic
Algorithm used for optimization is the Simple Genetic Algorithm (or
SGA) described by Goldberg (Goldberg, 1989). The actual software
implementation used comes from the "Galib" GA library provided by
the Massachusetts Institute of Technology (Matthew, 1999). In this
SGA we used non-overlapping populations, roulette-wheel selection
with a degree of elitism, as well as two-point crossover. The GA
parameters (unless stated otherwise, below) are crossover
probability Pc of 0.9. As for mutation, we used two styles:
flip mutation for GFS, and Gaussian mutation
for GFW. Gaussian
mutation uses a bell-curve around the mutated value to determine the
random new value. Under this bell-shaped area, values that are
closer to the current value are more likely to be selected than
values that are farther away. The mutation
probability Pm was0.02. The number of generations Ng was 50, and the population size Pop was also 50. The
fitness function was the classification accuracy of our own
1-nearest-neighbour (1-NN) classifier.
Page 1, Page 2,
Page 3,
Page 4
DEC Lab Home
Last Updated April 6, 2004
|