Research Graphic

Concordia DEC Lab

about us . courses . students . research. publications . events . people . contact us
 

 

Using Genetic Algorithms for Feature Selection and Weighting in Character Recognition System (cont'd)

 

4.1.3 Performance of both FS and FW in the presence of redundant features.

 

In this experiment we used the dataset with 6 features in DB3. We randomly selected one feature from the dataset, and repeatedly added this features several times. We observed the accuracy of classification for GA-based FS, GA-based FW, as well as the unaided 1-NN classifier. The error estimation method is the same as that used in section 4.1.2 (above). The results of experimentation are shown in figures 4 and 5. Figure 4 shows the empirical relationship between the number of redundant features, present in the initial set of features, and the classification accuracy of the 1-NN classifier, acting alone, and with the help of GA-based feature selection/weighting. Figure 5 presents the relationship between the number of redundant features, and the number of features eliminated by FS and FW. A 1-NN, on its own does not eliminates any features, of course.

 

Figure 4: Classification Accuracy as a Function of the Number of Redundant Features

 

Figure 5: Number of Eliminated Features as a Function of the Number of Redundant Features

 

From figures 4 and 5, the following observations can be drawn:

 

·        Contrary to the case of irrelevant features, the classification accuracy of the 1-NN classifier degrades very gradually, as a function of the number of redundant features.

·        In experiments with 4, 14 and 24 redundant features, FS classification accuracies are almost identical to those achieved by 3FW. However, for the same number of redundant features, the other feature weighting configurations returned classification accuracies that are worse, but only slightly.

·        FS outperforms all other FW configurations in experiments where the number of redundant features is 34 or greater.

·        As the number of weight levels increases, the classification accuracy of the feature weighting configurations slightly decreases.

·        Although the difference in classification accuracy between FS and FW is not very large, the difference in terms of the number of eliminated features (shown in fig. 5) is significant.

·        The difference in performance between FS and FW increases as the number of redundant features climbs. Finally, when the number of redundant features reaches 54, FS succeeds in eliminating 34 redundant features, 11 more than the best performing FW configuration (3FW).

 

Why does not the performance of the 1-NN classifier quickly degrade as the number of redundant features increases? The reason is that adding copies of an existing feature repeatedly to a set of features (to act as redundant features) is like giving that feature an added weight equal to the number of copies. For example, if a certain feature is added 20 times, this has the same effect as using this feature once, but multiplied by a weight of 20. On the other hand, completely irrelevant features add randomness and noise to the feature set, which cause higher classification error rates. Though the presence of redundant features in the training data might not significantly decrease the accuracy of the classification algorithm, it will certainly worsen the problem of dimensionality. Moreover, giving a certain feature higher weight simply because it was repeated is considered to be a random procedure (Wilson & Martinez, 1996).

In conclusion, the classification accuracy of a 1-NN classifier does not suffer greatly as a result of having a significant number of redundant features in the (original) feature set. However, due to the other problems associated with redundant features increased dimensionality (and hence computational cost), and arbitrariness of procedure, as well as slightly worse classification accuracies, it is recommended that a GA-based feature selection method be used to eliminate as many of the redundant features as possible. It is clear that the best suited feature selection/weighting methods for such a task are FS and 3FW. Which one is chosen depends on the need for real-valued weights vs. computational efficiency.

 

4.1.4 Performance of both FS and FW for Real-World and Regular Databases.

 

(Wettschereck et al., 1997) state that feature weighting (GA-based or not) is more suitable than feature selection for domains where features have a wide range of relevance. On the other hand, (Kohavi, Langley & Yun, 1997) show that increasing the number of weights above two rarely reduces classification error, for many real world datasets. They show that using only two weights (which is equivalent to feature selection) gives better results than increasing the set of weights. It is worth noting that the results of (Kohavi et al., 1997) are for small datasets (with less than 300 training samples), and use a best-first search algorithm. In contrast, we intend to compare feature selection to feature weighting using real-world and regular databases, with 1000+ training samples, and using GAs as a search method.

Presented here is a study of the classification accuracies achieved by the 1-NN classifier and the various GA-based feature selection (FS) and weighting (FW) configurations. The focus in this study is on assessing the true real-world performance of such configurations. This requires (a) that real-world databases (containing human-generated samples) are used, and (b) that any results achieved using training sets are checked against separate results obtained with new (unseen) validation sample sets. The resultant accuracies for training and validation sample sets are displayed in figures 6 and 7. The database used for figures 6 and 7 was the 6 feature dataset in DB3. For figures 8 and 9, we used DB2. The error estimation method is the leave-one-out cross validation. It was applied to 1000 training samples. The best weights are tested using a separate set of 500 validation samples. To avoid bias, we randomly selected different training/validation sets for each experimental run. Each experiment is repeated 5 times, and the results reported are average values for those. The symbols on the X-axes indicate the following:

 

  • 1-NN: 1-nearest neighbour classifier (no GA).
  • FS: FS using 2 weights (0,1).
  • FW3: FW using 3 weights (0,0.5,1).
  • FW5: FW using 5 weights (0,0.25,0.5,0.75,1).
  • FW17:  FW using 17 weights in [0,1], and an increment value of 0.0625 (1/16).
  • FW33:  FW using 33 weights in [0,1], and an increment value of 0.03125 (1/32).

                       

Figure 6: Training Classification Accuracies for the Various Feature Selection and Weighting Methods (using DB3)

 

            Results show that classification accuracy is worst for the (unaided) 1-NN, but better for FS and FW. The trend line highlights the fact that the greater the number of weight levels the higher the training accuracy rate achieved. Finally, it appears that classification accuracy could not be improved much by increasing the number of weights beyond 17; the difference in classification accuracy between 17FW and 33FW is 0.02, which is insignificant.

Figure 7: Validation Classification Accuracies for the Various Feature Selection and Weighting Methods (using DB3)

 

Results show that classification accuracy is best for the (unaided) 1-NN and FS, but worse for FW. The trend line highlights the fact that the greater the number of weight levels the lower the validation accuracy rate achieved. Finally, it appears that classification accuracy cannot be further degraded by increasing the number of weights beyond 17; there is no difference in classification accuracy between 17FW and 33FW. These results are the opposite of those found in figure 6! This means that FW is returning good accuracy results for training data, but then returning bad results for validation data. Also, it is the performance of a classifier on validation data that truly reflects its (real-world) predictive capacity.

To verify these interesting results, we ran the same set of experiments on another database (DB2). This is to ensure that the above results are independent of the particular sample database used. The results are shown in figures 8 and 9. It emerges that the results of these experiments indeed confirm the results of figures 6 and 7.

 

Figure 8: Training Classification Accuracies for the Various Feature Selection and Weighting Methods (using DB2)

 

Figure 9: Validation Classification Accuracies for the Various Feature Selection and Weighting Methods (using DB2)

 

Based on the results of figures 6-9 we make the following observations.

 

  • For the validation sets, the classification accuracy of FS was slightly better than all the FW settings for both databases. However, it was the same as 1-NN for DB3 and slightly worse for DB2.
  • For the training sets all FW settings have better classification accuracy than FS.
  • Increasing the number of weight levels led to a slight decrease in the classification accuracy of the validation sets.
  • Increasing the number of weights has led to slight increase in the classification accuracy of the training sets.

 

We believe the interesting disparity between training and validation results is due to over-fitting. FW allows for a finer-grain representation of the search space, but at the expense of an increased classification error rate. In other words, what we are faced with here is a typical bias vs. variance tradeoff (Geman, Bienenstock & Doursat, 1992). Classification error can be decomposed into two components, bias and variance. Bias measure how much the error estimation deviates from the true value, whereas the variance measures how much variability classification shows for different training samples. For example, assume that we have an infinite number of classifiers built from different training sets of the same size. The bias of a learning algorithm is the expected error of the combined classifiers on new data. The variance, on the other hand, is the expected error due to the particular training set used. Therefore using a small number of weights will increase the bias due to the lack of representation of the space. However, this will also reduce variance, which in turn reduces the possibility of over-fitting of data (Kohavi et al., 1997). In contrast, allowing many weights will reduce bias but will also increase variance, and the probability of over-fitting.

(Kohavi et al., 1997) also illustrate that increasing the number of weights above two hardly ever decreases classification errors. In their study they show that with 10 weights levels, FW fails to outperform FS on 11 real-world databases. They conclude that, "On many natural data sets, restricting the set of weights to only two alternatives-which is equivalent to feature subset selection-gives the best results". Though we used different search methods and larger datasets than those used by Kohavi et al., we get similar results and conclusions.

In addition, we should take into account that the databases we are using to report our results are handwritten character recognition databases. This means that there are many variants of letter shape, size, and generally, style. Also, different writers have different writing styles. For the letters of an alphabet, there are nearly an unlimited number of variations. Thus, feature weighting over-fits the training data by finding suitable weights, but these weights obtained do not generally represent the underlying distribution due to the large variance among the training samples.

On the other hand, other papers have shown that feature weighting using GAs yield a slightly better classification accuracy than feature selection for real-world databases. For example, (Punch et al., 1993) perform tests using binary and real-valued weights (with weights in [0,10]). Their database consisted of images of soil samples. Their results show that the error rates obtained using real-valued weights are better than those with binary weights. Also, (Komosinski & Krawiec, 2000) provide further evidence that feature weighting is somewhat better than selection when applied to a brain-tumor diagnoses system. Their results confirm that feature weighting (using weights in [0,9]) leads to somewhat better classification accuracy rates than just feature selection.

            However, several points must be raised here. First, results reported in (Komosinski & Krawiec, 2000) are those of GA evaluation. No validation tests were done to test the resultant weights on separate data sets. In fact, (Kohavi & Sommerfield, 1995) have noticed this fact and states that separate holdout sets, which were never used during the feature selection/weighting, should be used in the final test of performance.  Second, results reported by (Punch et al., 1993) were performed on validation samples, which were randomly drawn from the training samples. This also has the same biased effect on the results. Third, the improvement in classification accuracies mentioned in both papers for FW over FS is so little that it does not, in itself, provide final evidence that FW gives better results than FS on real datasets. Komosinski & Krawiec reports classification accuracies of 83.431.93 and 77.832.03 (for two datasets) for FW, vs. accuracies of 801.22 and 75.71.64 for FS. While Punch et al. reports a classification error rate for FW of 0.83% and 2% for training and testing respectively, those of FS were 1.66 and 3.2% for training and testing.

In conclusion, despite the fact that feature weighting has the best training classification accuracies, feature selection is better in generalization, and hence more suited to real-world applications (in which most data is new). This is because FW overfits the training data, losing generalizability in the process. Therefore, it is advisable to use 2 (FS) or 3 (3FW) weight levels at most, as (Kohavi et al., 1997) recommend.

 

4.2 Feature Selection Evaluation Results

 

The following are two empirical studies that study the effectiveness of GA-based feature selection (GFS) with respect to (a) finding a provably optimal set of features, and (b) doing so within an acceptable time frame, for off-line applications (e.g., pre-release optimization of character recognition software).

 

4.2.1 Convergence of FS to an Optimal or Near-Optimal Set of Features

 

In the context of feature selection, several comparative studies involving Genetic Algorithms and other search and optimization algorithms are available. (Vafaie & De Jong, 1993) show that GA outperforms sequential backward selection in robustness. They demonstrate that genetic algorithms improve robustness of feature selection without a major increase in computational complexity. Moreover, (Handels et al., 1999) compare a number of feature selection methods: Genetic Algorithms other heuristic searches, and 'greedy' methods (forward and backward searches). Their domain of application is diagnosis of skin tumors. The best classification accuracy is obtained by the GA. (Estevez & Fernandez, 1999) perform a comparison between Genetic Algorithms, statistical methods and the leave-one-feature-out approach, all in the context of classifying wood board defects. The GA gives the best performance among the three techniques, but at the expense of high computational complexity. (Kudo & Sklansky, 2000) provide a recent comparison of different feature selection algorithms (including GAs) for large feature spaces. The authors conclude that Genetic Algorithms are more suited than other feature selection algorithms for problems with more than 50 features. However, for small (0-19 features) and medium (20-49) cardinality, GAs though useful, are too slow to justify their use.

It is clear from the literature that several studies exist comparing GA with other feature selection search algorithms, so it not our intention to repeat such work. However, there is a need to determine whether or not GFS does indeed return optimal or near-optimal feature sub-sets. This necessitates an exhaustive search of the features space in order to find one or more bona fide optimum feature sub-set. These can then be used to assess (with certainty) the optimality of the best GFS-generated feature sub-sets.

            For the experiments in sections 4.2.1 and 4.2.2, we used a single training/testing set of size 1000 and 500, respectively. The numbers of features used in the searches are 8, 10, 12, 14, 16, and 18. We used features from more than one dataset in DB3 for that purpose. There is no need to have a separate validation set since our objective here is to evaluate optimal performance, and not test generalizability.

 

Number of Features

Best Exhaustive (classification rate)

Best GA (classification rate)

Average GA (for 5 runs)

8

74

74

74

10

75.2

75.2

75.2

12

77.2

77.2

77.04

14

79

79

78.56

16

79.2

79

78.28

18

79.4

79.4

78.92

 

Table 3: Best Classification Accuracy Rates Achieved by GA and Exhaustive Search

 

The results presented in table 3 compare the best and average accuracy rates achieved via GA-based feature selection (FS) with the optimal accuracy rate found via an exhaustive search of the entire feature space.

            Comparing the numbers in the best GA column with the numbers in the best exhaustive column exhibits the success of the GA as a feature set optimizer. In every case but one (when the number of features is 16), the best accuracy achieved by the GA was identical to that found an exhaustive search of the feature space. Furthermore, the fact that the worst average GA value is less than 0.8 of a percentage point away from the optimal value means that the GA was consistently able to return optimal or near-optimal values of accuracy. Hence, GA-based feature selection consistently converges to an optimal or near-optimal set of features.

 

4.2.2 Convergence of FS to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations.

 

In general, the run time for the GA is proportional to the number of features, number of generations and size of the population (Kudo & Sklansky, 2000). Moreover, the time complexity of the nearest neighbour classifier is proportional to (x d), where t is the number of training instances and d is the number of features used. So a GA that is applied in a wrapper configuration to a nearest neighbour classifier spend most of its time running the nearest neighbour classifier (Brill, Brown & Martin, 1992). Below, we list those factors that affect the time complexity of our GFS or GFW:

 

  • Number of generations
  • Population size
  • Number of features
  • Number of training instances

 

In this evaluation, we will investigate the relationship between the number of features (in the original set) and the number of generations required to reach an optimal or near-optimal sub-set. We will do that while keeping the two other factors (i.e. population size and number of training instances) constant.

The results of experimentation are shown in table 4. The second column of table 4 shows the best (i.e. optimal) classification accuracy rates attained using an exhaustive search, whereas the third column displays the duration of that search. The fourth column contains the best results achieved via a GA-based search, while the fifth column contains the time it took the GA to attain those values.

 

Number of Features

Best Exhaustive (optimal and sub-optimal)

Exhaustive Run Time

Best GA

Average GA (for 5 runs)

Number of Generations

GA Run Time (single run)

8

74, 73.8

2 minutes

74

73.68

5

2 minutes

10

75.2, 75

13 minutes

75.2

74.96

5

3 minutes

12

77.2, 77

47 minutes

77

76.92

10

5 minutes

14

79, 78.8

3 hours

79

78.2

10

5.5 minutes

16

79.2, 79

6 hours

79.2

78.48

15

8 minutes

18

79.4, 79.2

1.5 days

79.4

78.92

20

11 minutes

 

Table 4: Number of Generations to Convergence

 

The following observations can be made:

 

  • In every case, GA-based FS was able to find the optimal or sub-optimal values found by exhaustive search. However, the GA took much less time to find the same optimal or near-optimal values returned by the exhaustive search. With only 8 features, the run times for both exhaustive and GA searches are identical. However, for 18 features, the run time for the exhaustive search was 1.5 days compared to 11 minutes for the GA.
  • As the number of features increases, the number of generations required finding optimal or sub-optimal values increases as well.

 

Figure 10 below displays the relationship between the numbers of generations needed to reach

optimal or sub-optimal values with a GA, and the number of features in the original set. The tiny triangles represent actual data from table 4 (above). There are also two curves in figure 10. The solid line represents a best-fit piece-wise linear curve that fits to, and extrapolates, the data points. The extrapolated segment of this curve (beyond 18 features) represents the most optimistic projection of GA run time duration.  The dotted line represents an exponential curve that fits to, and extrapolates, the same data points. The extrapolated segment of this curve (again, beyond 18 features) represents the most pessimistic projection of GA run time duration. The reason for use of extrapolation is the obviously impossible amount of time required to run exhaustive searches of large feature spaces using a single-processor computer. In any case, one can safely conclude that the time needed for a GA-based search is bound, on the lower side by the (optimistic) best-fit curve, and on the upper side, by the (pessimistic) exponential curve.

 

 

Figure 10: Number of Generations to Convergence as a Function of Number of Features

 

Using exponential extrapolation, the number of generations expected for 60 features is 10585, while it is only 89 for linear extrapolation. The true value should be something in between. If we assume that the true value is midway between 89 and 10585 (which is 5337), then running the GA will be computationally expensive. In such cases, using methods of parallel GA, such as DVeGA (Moser, 1999), becomes necessary. Also, the authors are currently involved in the construction of a 10-processor dedicated Linux-PC cluster, which allows for genuine implementations of parallel GA architectures, including the one just mentioned. This is important, since the relative value of GA-based techniques is most pronounced when the search (e.g. feature) spaces involved are quite large.

            We conclude that GA-based feature selection is a reliable method of locating optimal or near-optimal feature sub-sets. These techniques also save time, relative to exhaustive searches. However, their effective use in large feature spaces is dependant upon the invention of specialized feature selection/weighting methods (e.g., DveGa) or/and the availability of parallel processors.

 

Page 1, Page 2, Page 3, Page 4