Research Graphic

Concordia DEC Lab

about us . courses . students . research. publications . events . people . contact us
 

 

Using Genetic Algorithms for Feature Selection and Weighting in Character Recognition System (cont'd)

 

4 Results and Analysis

 

This has two pats: (a) a comparative empirical study between GA-based feature selection and GA-based feature weighting algorithms, within the context of character recognition systems (section 4.1); and (b) an empirical evaluation of the effectiveness of GA-based feature selection for off-line optimization of character recognition systems (section 4.2).

 

4.1 Comparative Study Results

 

Following are four empirical studies that compare the performance of GA-based feature selection (GFS) to GA-based feature weighting (GFW), with respect to (a) the number of eliminated features, and (b) classification accuracy. In all the experiments, 1-NN stands 1-nearest neighbour classifier (i.e. no GA-based optimization), FS stands for GFS, and XFW stands for GFW with an X-number of weight levels.

 

4.1.1 The effect of varying the number of values that weights can take on the number of selected features

 

It is known that feature selection, generally, reduces the cost of classification by decreasing the number of features used (Dash & Liu, 1997). GFW methods should, theoretically, have the same potential (because GFS is a special case of GFW). However, it has been argued that in reality, GFS eliminates many more features than GFW. How many more, though, is unknown. Since the essential difference between GFS and GFW is the number of values that weights can take, we decided to study the relationship between the number of weight values and the number of eliminated features. We also tested a method for increasing the ability of GFW to eliminate features.

            The database used in this experiment is DB1, which contains a relatively large number of features (64). The error estimation method used is the leave-one-out cross validation technique. It is applied to the training data itself. The number of training samples is 200. It is applied to the training data. The resultant weights are assessed using a validation data set of size 200. This set is new, in that it is not used during training. The results of the experiments are presented in tables 1 and 2. The following description of columns applies to the contents of both tables. The first column presents the method of feature selection or weighting. It is (a) 1-NN, which means that a 1-NN classifier is applied directly to the full set of features, with no prior selection or weighting; (b) FS, which means that feature selection is applied first before any classification is carried out; or (c) FW (with different weighting schemes), which means that feature weighting is applied prior to 1-NN classification. The second column, increment value, shows the difference between any two successive weight levels. The third column presents the total number of weight levels that a weight can take. A weight can take on any one of a discrete number of values in the range [min, max]. The difference between any two consecutive values is the 'increment value'. The fourth column is simply the inverse of the number of levels, which is termed: 'probability of zero'; the reason for such terminology will be made clear presently. The fifth and sixth columns show classification accuracies for both training and testing phases, respectively. The final column presents the number of eliminated features, which we call the number of zero features. This is easily computed by subtracting the number of selected features from the total number of features (64). It is worth noting that values shown in columns 5-7 are the average values of five identical runs.

 

Method of Selection/

Weighting

Increment Value

Number of Levels

Probability of Zero

Accuracy of Training

Accuracy of

Testing

Number of Zero Features

1-NN

-

-

-

97.5

84.5

-

FS 

1

2

1/2

99.4

84.8

27

FW

0.5   (1/2)

20+1

1/21

99

84

3

FW

0.25 (1/4)

40+1

1/41

98.9

83.8

1

FW

0.125 (1/8)

80+1

1/81

99

83.5

0

FW

 

0.0625 (1/16)

160+1

1/161

98.9

83.9

0

FW

0.03125 (1/32)

320+1

1/321

98.6

83.2

0

FW

0.015625 (1/64)

640+1

1/641

99

83.2

0

FW

0.0078125 (1/128)

1280+1

1/1281

98.9

83.7

0

FW

0.0039 (1/256)

2560+1

1/2561

98.2

83

0

 

 

Table 1: Accuracy of Recognition and Number of Zero Features for Various Selection and Weighting Schemes

 

 

Method of Selection/

Weighting

Increment Value

Number of Levels

Probability of Zero

Accuracy of Training

Accuracy of

Testing

Number of Zero Features

FS 

1

2

1/2 (0.5)

99.4

84.8

27

FW: three discrete values: 0, 0.5, and 1.

0.5

3

1/3 (0.33)

97.8

83.3

18

FW: values belong to [0, 10], with weights < 1 forced to 0.

0.125

81

8/81 (0.098)

98.1

84.2

7

FW: values belong to       [0,10].

1

11

1/11 (0.09)

98.5

82.8

5

FW: six discrete values: 0, 0.2, 0.4, 0.6, 0.8, and 1.

0.2

6

1/6 (0.166)

98.3

83.3

8

FW: six discrete values: 0, 0.2, 0.4, 0.6, 0.8, and 1, with weights < 0.8 forced to 0.

0.2

6

4/6 (0.66)

96.6

81.6

38

FW: six discrete values: 0, 1, 2, 3, 4, and 5.

1

6

1/6

97.9

83.4

9

FW: six discrete values: 0, 1, 2, 3, 4, and 5, with weights < 4 forced to 0.

1

6

4/6 (0.66)

97.4

81.2

34

 

Table 2: Accuracy of Recognition and Number of Zero Features for Various Selection and Weighting Schemes, Some with Low Weight Forced to Zero

 

The following observations can be made, based on the results in table 1.

 

  • Although feature selection succeeded in eliminating roughly 42% of the original set of features (27 out of 64), classification accuracy did not suffer as a result. On the contrary, training accuracy increased to 99.4% from the 97.5% realized by the 1-NN alone. Also, testing accuracy increased to 84.8% from the 84.5% value achieved by the 1-NN classifier (alone).
  • FS far outperforms FW in terms of the number of zero features. Also, training accuracies achieved by both FS and FW were better than those achieved by the 1-NN classifier (alone). Using the testing set, the accuracy levels achieved by FW range between slightly worse to worse than the accuracy levels achieved by the 1-NN classifier. In contrast, the accuracy levels achieved by FS are slightly better than those of the 1-NN classifier alone (and hence better than those of FW as well).

 

FS does not eliminate features at the expense of classification accuracy. This is because the fitness function used is dependent on classification accuracy only. There is no selective pressure to find weight sets that are (e.g.) smaller.

 

The following observations can be made, based on the results in table 2.

 

  • The number of zero features is greater in cases where the number of possible weight values is countably finite, than where weights take values from an infinitely dense range of real numbers.
  • Also, increasing the number of levels beyond a certain threshold (81), reduces the number of zero features to nil.
  • When we use FW with discrete values (0,1, 2, 4, 5), and forced all weights less than four to zero, FW actually outperforms FS in the number of zero features. It is worth noting that the Probability of Zero for this FW configuration is 0.66, compared to 0.5 for FS.
  • Regardless of the method of selection or weighting, the number of zero features appears to be influenced by only the number of levels that weights can take. For example, using six discrete values, but in two different configurations, (0, 0.2, 0.4, 0.6, 0.8, and 1) and (0,1,2,3,4,5, and 6), produces almost identical numbers of zero features: 8 and 9, respectively.

 

All the points above suggest that, generally, the greater the total number of weight levels, the less likely it will be that any of the features will have zero weights. Whether this apparent relationship is strictly proportional or not is investigated further below.

Using data from tables 1 and 2 the relationships between the number of zero features and the number of levels has been drawn. This relationship is depicted in Fig. 1a.   

 


Figure 1a: Relationship between Number of Weight Levels and Actual Number of Zero Features

 

It represents the empirical relationship between the number of weight levels and the actual number of zero (or eliminated) features. The relationship is close to an inversely proportional one. This represents credible evidence that (a) the number of eliminated features is a function of, mainly, the number of weight levels; and hence (b) that the main reason behind the superiority of feature selection over feature weighting (in eliminating features) is the smaller number of weight levels FS uses.


 

 

 

 

 

 

 

 

 

Figure 1b: The Number of Eliminated (or Zero) Features as a Function of the Probability of Zero

 

It further appears, from Figure 1b that the relationship between the 'probability' of zero features and the actual number of zero features is roughly linear, though not strictly proportional. If it were strictly proportional then the 'probability' of zero would likely represent a real probability, hence the term. As things stand, the empirical results support the claim that the underlying probability distribution of weight values is (roughly speaking) uniformly random.

In conclusion, it is possible to state that feature selection is clearly superior to feature weighting in terms of feature reduction. The main reason for this superiority appears to be the smaller number of weight levels that feature selection uses (2), compared to feature weighting (potentially infinite). However, it is possible to make feature weighting as effective as feature selection in eliminating features via, for example, the forcing of all weights less than a given threshold to nil. 

 

4.1.2 Studying the performance of both FS and FW in the presence of irrelevant features

 

For most classification problems, relevant features are not known in advance. Therefore, many more features than necessary could be added to the initial set of candidate features. Many of these features can turn out to be either irrelevant or redundant. (Kohavi & John, 1996) define two types of relevant (and hence irrelevant) features. They state that features are either strongly relevant or weakly relevant, otherwise they are irrelevant. A strongly relevant feature is one that cannot be removed without degrading the prediction accuracy of the classifier (in every case). A weakly relevant feature is a feature that sometimes enhances accuracy, while an irrelevant feature is neither strongly nor weekly relevant. Irrelevant features lower the classification accuracy while increasing the dimensionality of the problem. Like irrelevant features, redundant features have the same drawbacks of accuracy reduction and dimensionality growth. As a result, removing these features by either feature selection or weighting is required. (Wettschereck et al., 1997) claim that domains that contain either a) sub-sets of equally relevant (i.e. redundant) features or b) any number of [strongly] irrelevant features are most suited to feature selection. We intend to investigate this claim by comparing the performance of GFS to that of GFW in the presence of irrelevant and (later) redundant features. It is important to indicate that Wilson and Martinez (1996) compares the performance of genetic feature weighting GFW with the non-weighted 1-NN for domains with irrelevant and redundant features. They use GA to find the best possible set of weights, which gives the highest possible classification rate. In the presence of irrelevant and redundant features, they show that GFW provides significantly higher results than the non-weighted algorithm. However, they do not compare GFW to GFS for classification tasks with irrelevant or redundant features. Therefore, we are presented with a good chance to see how far feature selection tolerates irrelevant and redundant features, as opposed to feature weighting.

            In this experiment we use the dataset that contains 6 features within DB3. We gradually add irrelevant features, which are assigned uniformly distributed random values.  We observe the classification accuracy for GA-based feature selection, GA-based feature weighting, as well as a 1-NN classifier (unaided by any kind of FS or FW). During GA evaluation, we use the holdout method of error estimation. The samples are split into three sets, a training set, a testing set, and a validation set. The training samples are used to build the nearest neighbour classifier, while the testing samples are used during GA optimization.  After GA optimization finishes, a separate validation set is used to assess the weights (coming from GA optimization). The number of training samples is 1000, the number of testing samples is 500, and the number of validation samples is 500. To avoid any bias due to random partitioning of the holdout set, random sub-sampling is performed. The random sub-sampling process is repeated 5 times, and the accuracy results reported represent average values of 5 runs.

The results of experimentation are shown in figures 2 and 3 below. Figure 2 represents classification accuracy (of the various selection and weighting schemes) as a function of the number of irrelevant features included in the initial set of features. Figure 3 represents the number of eliminated features as a function of irrelevant features. In the figures, FW3 stands for feature weighting using 3 discrete equidistant weight levels (0,0.5, 1), FW5 stands for FW with 5 discrete equidistant weight levels, while FW33 stands for FW with 33 discrete equidistant weight levels.

 

 

 

Figure 2: Classification Accuracy as a Function of the Number of Irrelevant Features

 

Figure 3: Number of Eliminated Features as a Function of the Number of Irrelevant Features

 

 

From figures 2 and 3 we can conclude the following:

 

  • As the number of irrelevant features increases, the classification accuracy of the 1-NN classifier rapidly degrades, while the accuracies attained by FS and FW slowly degrade. Therefore, nearest neighbour algorithms need feature selection/weighting in order to eliminate/de-emphasize irrelevant features, and hence improve accuracy.
  • As the number of irrelevant features increases, FS outperforms every feature weighting configuration (3FW, 5FW and 33FW), with respect to both classification accuracy and elimination of features. However, when the number of irrelevant features is only 4, 3FW returns slightly better accuracies than FS. This changes as soon as the number of irrelevant features picks up.
  • When the number of irrelevant features is 34, the difference in accuracy between FS and the best performing FW (3FW) is considerable at 6%.
  • As the number of weight levels increases, the classification accuracy of FW, in the presence of irrelevant features, decreases.
  • As the number of weight levels increase, the number of eliminated features decreases.
  • When the number of irrelevant features reaches 54, the classification accuracy of FS drops to a value comparable to that of other FW configurations. However, the number of eliminated features by FS continues to outperform any other FW configuration.
  • The gap in the number of features eliminated by FS compared to FW and 1-NN increases as the number of irrelevant features increases.

 

Why does feature weighting perform worse than feature selection in the presence of irrelevant features? Because it is hard to explore a space of real valued weights in  when d is large. R is the number of real-valued weights, and d is the number of features. Whereas, in the case of  feature selection, the search space is  in size. For example, with 10 irrelevant features, the search space for FS is  (=1024) compared to (=25937424601) for FW using 11 weight levels. Moreover, a single increase in the number of weight levels from 2 to 3 increases the size of the search space form (=1024) to (=59049). As witnessed earlier, as the number of weight levels increases, the number of eliminated features decreases. Therefore, FS will always eliminate more features than FW, given the same computational resources. In addition, in applications with large sets of features (which implies a high degree of irrelevancy/redundancy among features), it has been shown that feature selection had better results than IB4, an on-line feature weighting algorithms (Aha & Banket, 1994).

We conclude that in the presence of irrelevant features, feature selection and not feature weighting is the technique most suited to feature reduction. Furthermore, it is necessary to use some method of feature selection before a 1-NN or similar nearest-neighbour classifier is applied, because the performance of such classifiers degrades rapidly in the presence of irrelevant features. Since it has been shown that GA-based feature selection is effective in eliminating irrelevant features (here and in the literature), it seems sensible to try a GA (at least) as a method of feature selection before classification is carried out using nearest-neighbour classifiers.

 

Page 1, Page 2, Page 3, Page 4