Testing

Introduction

In software testing, we want to avoid software failure. We want to somehow prove that the system we have built can effectively and properly be put into operation by inspecting its reactions in different situations, given different inputs, and verify that it reacts properly to them.

There are essentially two ways of ensuring that the number of faults in code is small or — ideally — zero. One is to prove that the code has no faults and the other is to perform tests designed to find all faults. Much effort has been put into proof techniques for programs, but the resulting tools and techniques are not widely used in industry. (The main exception is in the UK, where there are legal requirements for certification for certain kinds of critical application.) Consequently, the main technique for fault detection is testing. Dijkstra has said:

testing can reveal only the presence of faults, never their absence.

In other words, you can never be certain that your program contains no faults but, with sufficient testing, you can be confident that the number of remaining faults is acceptably small.

Another in-between alternative is to rely on formal system specifications to build a model (or theory) of expected system behavior, which can then be mathematically proven to be complete; and then derive test cases that will properly test all aspects of the system specifications, thus, in theory, proving the system to be totally tested.

That is a strict application of the scientific method. However, any scientific should be aware that any theory is valid within a certain context of application, any given valid theory can be proven to be correct in a given situation, but can be invalidated if put in a higher-order situation. For example, Newton's theory of gravitation is observed to be perfectly valid in the vast majority of every day situations, but fails in the extreme situations where relativistic effects are coming onto play through extreme speed or gravitational effects, as pointed out by Einstein's theory of relativity. In fact, nothing tell us that Einstein's theory of relativity is valid in all parts of the Universe.

Following this scientific reasoning, formal specifications can only build theories that are applicable within an established context, and basing the testing of an application on this model only tests the validity of the system with regard to the specification model, and not to the other possible contexts into which this model actually fails.

Variety of System Faults and Failures

There are many different ways in which software may fail, for example:

A requirement was omitted: e.g. the client loses a master file and complains that the program should have maintained backups of important files automatically. This is a serious “failure” from the client’s point of view but we cannot blame the software product for the failure because there was no requirement for automatic backups.

The requirements were unimplementable: e.g. the client complains that the system is so busy with housekeeping that it never gets any work done. Investigation shows that both hardware and software are performing correctly, but the hardware configuration is inadequate for the client’s needs.

The specification was incorrect or incomplete: e.g. a student logs in to the University’s Student Information System and requests her grade for a non-existent course; the system crashes because the specification stated that a course number consisted of four letters and three digits and defined no further validation.

There was an error in the high-level design: e.g. the component for remote access must be linked to the client database in order to validate customers who access the system by telephone. Since the link was omitted in the high-level design, the people working on the detailed design did not notice the need for validation.

There was an error in the detailed design: e.g. the specification required unique keys for a database but the design did not enforce uniqueness. The program was implemented according to the design but does not meet its specification.

There was an error in the code: e.g. the design states that no action is to be performed if a certain pointer is NULL. The programmer omitted the test for NULL and the program crashed when it attempted to dereference the pointer.

Terminology

Failure

In each of the examples above, a fault causes a failure, and a symptom is an observable behavior of the system that enables us to detect a failure. The process of discovering what caused a failure is called fault identification. The process of ensuring that the failure does not happen again is called fault correction or fault removal. Software testing, in practice, is about identifying a certain possible system failure and design a test case that proves that this particular failure is not experienced by the software.

More generally, a failure is detected through an observable symptom that indicates a certain lack of quality in the software. From that more general definition, we see that a failure is not necessarily observed through obvious symptoms such as system crash or miscalculation, but could come through more subtle symptoms such as slight lack of efficiency or overconsumption of computer resources. Plus, certain symptoms and failures will occur only in very specific and unexpected situations. That possibility leads to the explosion of possibilities of failures by having to imagine all expected and unexpected situations that the system can be put in reality. Most often, computer systems are being tested only in expected situations, and only with regard to its inevitably incomplete specifications. That is often sufficient, but only in cases of system of low criticality. On top of this, the detection of symptoms has to be observable. Unobserved symptoms because of lack of proper tools or techniques to observe them leads to apparent absence of the symptom, thus failure to detect the failure. Sometimes, a combination of failures will also occult the observability of a certain failure, e.g. a huge memory leak might make the program crash, preventing another failure from happening. All these factors lead to the extreme complexity and effort required to properly test an application of reasonable size and complexity.

In fact, different failures have different levels of criticality. Some failures will have disastrous or otherwise totally unacceptable effects, such as an airplane control software totally crashing during take-off, resulting in the death of hundreds of people. Other failures have milder and more acceptable adverse effects, such as the flight statistics report being wrongly generated on the same airplane. Failures of higher criticality should evidently be the topic of more avoidance efforts through system and code robustness, as well as testing effort to make sure that this failure cannot happen in any expected or unexpectd situation.

Fault

Bug is a popular term for fault. There is a story to the effect that the first hardware failure was caused by a bug that was trapped between relay contacts in an early computer. This story is almost certainly mythical, because engineers were using the word "bug" before any computers had been built. But the usage is important: the word “bug” suggests that faults wander into the code from some external source and that they are therefore random occurrences beyond the programmer’s control. This is not possible for software: if your code has faults, it is because you put them there.

As presented in the introduction above, note that faults are not all located in the program code. For example, a missing requirements is not a coding fault but a requirements fault. In fact, faults can be present in any artifact being built as part of the software development process. That is why it is of great importance to verify the quality of each and every artifact being built throughout the entire process.

Note also that hardware faults are sometimes an important and overlooked aspect of computer systems quality assurance and testing. An airplane crashing because of a failure of the disk drive of its computer system is a possibility to consider. Solutions to such possibilities often comes through hardware duplication. Many critical systems in fact are totally duplicated, software and hardware included. A primary system is in control, while the backup system is being operational and able to take control at any time. Such systems are meant to be highly fault tolerant.

Validation and Verification

Validation: Testing activities whose goal is to assess the conformance of the final product to the user’s needs. It answers the question: ”Are we building the right product?” Validation can be applied to any artifact being built as part of the development process. From this more general perspective, valiation is about verifying that a given artifact is consistent with its expected goals or premise, as defined by the artifacts that are used in input to the activity that is creating this artifact.

Verification: Testing activities whose goal is to ensure that the software correctly implements specific functions. it answers the question: ”Are we building the product right?" Verification is most often about verifying that a valid program also meets intrinsic qualities that were not necessarily stated as part of the requirements, e.g. any valid functionality of an application is unacceptable if it often makes the program crash, takes too much time to execute, or unduly uses computer resources. Verification can also be applied to any artifact being built as part of the development process. In this case, intrinsic qualities of the artifact are being inspected and verified, e.g. verifying that a design is effectively implementable is something that is not stated as a requirement, but is a necessary quality of a design artifact.

Static [verification | validation]: Testing activities that are undertaken without running the system. Includes documents and code reviews, compiling, and cross-reference checking.

Dynamic [verification | validation]: Testing activities that are undertaken by experimenting with the system at run time, to assess if it behaves and performs as expected.

Coding Errors

As seen in the previous section, coding errors are only one of the many type of faults that can be found during validation and verification. Yet, there are many kinds of fault that can exist in code; some of them are listed below:

Syntax Errors: Most syntax errors are detected by the compiler, but a few are not. (Some people would argue that all syntax errors are caught by the compiler by definition, and that a compiled program therefore cannot contain syntax errors.) However, a declaration such as char* p, q; is best classified as a syntax error (if q is meant to be a pointer) but the compiler does not report it.

Algorithmic Errors: A method does not execute according to its specification because there is an error in the coding logic. Some common causes are: incorrect initialization; “off by one” error (example: starting at 1 rather than 0, stopping at N rather than N − 1); incorrect condition (example: forgetting de Morgan’s laws); misspelling a variable name; type errors that remain undetected in the presence of implicit conversion.

Precision Errors: These arise e.g. when a programmer uses an int where a long was needed, or a float where a double was needed. Precision errors can also occur as the result of implicit conversions. A common example is expecting 2/3 to evaluate to 0.666667 rather than 0 (note that the operands might be variables rather than integer literals).

Documentation Errors: The comments and accompanying code are inconsistent with one another. If the code is correct, there is not actually a fault. However, there is a strong likelihood of a fault being introduced later by a programmer who believes the documentation rather than the code.

Stress Errors: These are faults that occur when the system is stretched beyond its designed limits. For example, a simple operating system that had an array of size N for active users might fail when user N +1 tries to log in. Two kinds of failure are possible: if the new user is refused but the system continues to run, it is a soft failure; if the system crashes, it is a hard failure.

Capacity Errors: A word processor is required to handle documents up to 2 Gb in size but its performance is unacceptably slow for documents over 1 Mb.

Timing Errors: A system with multiple processes or threads fails occasionally because two events occur in the wrong order or simultaneously. Errors of this kind are very hard to diagnose.

Performance Errors: The requirements state that the system must respond in less than 1 second but in fact the response to some kinds of query is more than 10 seconds.

Recovery Errors: Some systems need to be “fault tolerant”. That is, it is supposed to detect certain kinds of failure and recover gracefully. A failure to recover from an anticipated failure is a “recovery error”.

Hardware Errors: Hardware errors are usually, but not necessarily, beyond the programmer’s control. However, safety-critical systems may anticipate hardware failures and, when such a failure occurs, handle it by switching to a stand-by system, for example.

Error of Omission: Something which should be present is not. For example, a variable has not been initialized.

Error of Commission: Something that should not be present is. For example, an exception handler for a low-level failure is included when handling should have been delegated to a higher level component of the system.

Testing Teams

When you test your own code, or when you review a document, it is tempting to be ”friendly” to it. After all, you don’t want your nice, new program to crash and, partly subconsciously, you will avoid giving it test data that might make it crash. This is the wrong attitude. A tester or document verifier should be as nasty as possible. The goal of verification and validation is to make the software fail and to find errors in the verified documents.

It follows that the authors of the tested artifacts are not the best people to test their own artifacts. It is better to assemble a separate testing team of people who have no interest in the documents and code and who get rewarded for smashing it.

For example, programmers don’t like people who smash their code. It is important to develop psychological habits that avoid ”blaming” developers for errors in their code. The viewpoint that everyone should adopt is: ”we all want the system to work as well as possible and we will all do what we can to remove errors from it. This attitude is sometimes called (following Gerald Weinberg) egoless programming.

Varieties of Testing

There are various ways to detect faults: reviews are important but, at some stage, it is usually necessary to resort to testing. A simple definition of testing: execute the system (or one of its components) and compare the results with anticipated results. The ”anticipated results” are the results predicted by the requirements and specifications if we are testing the whole system. It may be harder to pin down the anticipated results when we are testing components or individual units but, in general, they should be predicted by the design document. There are various kinds of testing; not all of the kinds listed below would be needed in small or medium scale projects. Testing activities can be classified either by their goals or their phase:

Goal-Driven Testing

Requirements-driven testing: Develop a test-case matrix (requirements vs. tests) to insure that each requirement undergoes at least one test.

Structure-driven testing: Construct tests to cover as much of the logical structure of the program as possible.

Statistics-driven testing: These tests are run to convince the client that the software is working in most of the operational situations. Results are heavily dependent on the validity of the statistical input.

Risk-driven testing: These tests check the ”worst case scenarios” and boundary conditions on critical components or features of the system.

Phase-Driven Testing

Unit Testing: (Also called module testing or component testing.) We test a component of the system in isolation. The component will expect inputs from other system components and will generate outputs that are sent to other components. For unit testing, we must construct an environment for the component that provides suitable inputs and records the generated outputs. This kind of testing is done by the developers as they produce new modules.

Integration Testing: We test all of the components of the system together to ensure that the system as a whole “works” (that is, compiles and doesn’t crash too often). Components are integrated in a specific order and, optimally, tests are applied when each component is integrated in the whole system. This kind of testing is done by the developers, as modules are developed, and when their unit testing is properly done.

Function Testing: Assuming that the system ”works” , we test its functions in detail, using the requirements and/or conceptual design as a basis for test design. Function testing is achieved by the testing team. Performance Testing. We measure the system’s speed, memory and disk usage, etc., to confirm that it meets performance requirements. Performance testing is achieved by the testing team.

Acceptance Testing: The previous tests are usually performed by the suppliers. When the suppliers believe the software is ready for delivery, they invite the clients in to confirm that they accept the software. This kind of testing is done at the developer’s site, where a team of testers from the client part is invited to test the system in a controlled environment. This is often called alpha testing.

Installation Testing: When the clients have accepted the software, it is installed at their site (or sites) and tested again. The setup of the company’s computers is extremely likely to be different from the setup of the machine on which you developed the software. For example, the compiler, the processor, the network connection setup, etc. might be different. Installation testing is there to make sure that the software still behaves in the appropriate manner on the client’s site, in real-life operation mode.

Designing Test Cases

Designing appropriate test cases is one of the most challenging aspects of software development. However, software engineers often treat testing as an afterthought, developing test cases that may ”feel right”, but that have very little assurance of being complete. Recalling the objectives of testing, we must design test cases that have the highest likelihood of finding the most errors within a minimum amount of time and effort. A test has two parts:

A procedure for executing the test. This may include instructions for getting the system into a particular state, input data, etc.
An expected result, or permitted range of results.

The ideal situation would be that we have one test for each requirement in the SRD. It is very unlikely that this ideal will be met totally. The goal, however, is to ”cover” the SRD as completely as possible. Suppose that the SRD contains the following extremely simplistic requirement:

When the user enters X and Y , the program displays X + Y.

Suppose X and Y are 32-bit numbers. This requirement calls for 2^64 tests. Plus, if one wants to test values outside of the valid range, there is potentially an infinite number of test cases to be applied. In fact, any function potentially can be the topic of an infinite number of test cases in order to test it totally. In practice, of course, we assume some form of continuity. If the program adds a few numbers correctly, and it works at the boundaries, we can safely assume that it adds all numbers correctly.

The main question is: how do we design a minimal set of test cases that will have the highest likeliness of finding errors? or more specifically, on what criteria do we base the nature of the data given in input to the system for each test case? The answer is either:

from our knowledge of how the system and its components are supposed to react , or
from our knowledge of the algorithms and code implemented for all the functions of the system. They respectively correspond to black box (or behavioral) testing and white box testing.

Black Box Testing

Definition: Knowing the specified functions that each component of system was designed to perform, tests are conducted that demonstrate that each function is fully operational while at the same time searching errors in each function.

Black box tests are conducted only by knowing the interface of a function (or member functions of a class, in the case of object oriented programming). Each aspect of the interface, i.e. each parameter of each function is tested within, at the borders, and outside the limits of applicability of the function. We also know what is the expected behavior of the function according to its input. Formally speaking, we know that a function F is a mapping:

F : I1 × I2 × ... × In → O

where I1...n are the input sets, or ranges of the parameters of the function, and O is the image of the function. Having such a definition permits us to map any valid or invalid input to the expected result. Note that critical systems requires the development of test cases based on formal grounds, which means that the formal definition of every critical function in the system must be defined and test cases derived according to this formal definition. Of course, we might not have such a formal definition for each function in the system, but we should certainly have at least a very good idea of the expected behavior of each function. That should be sufficient for reasonable testing.

Example: Suppose that the specifications for a certain data processing product state that five types of commission and seven types of discount must be incorporated. Testing every possible combination of just commission and discount requires 35 test cases. It is no use saying that commission and discount are computed in two entirely separate modules and hence may be tested independently in black box testing the product is treated as a black-box and its internal structure is therefore completely irrelevant.

Because black-box testing purposely disregards control structure, attention is focused on the information domain. Tests are designed to answer questions like:

How is functional validity tested?
How is system behavior and performance tested?
What classes of inputs will make good test cases?
Is the system particularly sensitive to certain input values?
How are the boundaries of data classes isolated?
What data rates and data volume can the system tolerate?
What effect will specific combinations of data have on the system operation?

Equivalence Partitioning

To limit the number of test cases required to test a function using the black-box approach, equivalence partitioning of the input space can be used. Equivalence partitioning is about partitioning the domain of inputs into disjoint sets such that input in the same set exhibit similar, equivalent or identical properties with respect to the test being performed. There is normally at least two equivalence classes:

Valid equivalence class: Class of input that satisfies the input conditions

Invalid equivalence class: Class of input that violates the input conditions

Of course, there might be several valid or invalid equivalence classes.

Example 1: Function which takes a 5-digit number n as input:

	class set	test case data
set 1 (invalid class)	{ n < 10000 }	127
set 2 (valid class)	{ 10000 < n < 99999 }	18745
set 3 (invalid class)	{ n > 99999 }	341236

Example 2: Function which takes a pointer p to a linked list as input. The length l of the list ranges from 1 to 100:

	class set	test case data
set 1 (invalid class)	{ empty list }	p = NULL
set 2 (valid class)	{ 1 < l < 100 }	l = 15
set 3 (invalid class)	{ l > 100 }	l = 125

Example 3: Function which inputs ”the owners of a car, where a car has one through six owners:

	class set	test case data
set 1 (invalid class)	{ no owners }	#owners = 0
set 2 (valid class)	{ 1 < owners < 6 }	#owners = 3
set 3 (invalid class)	{ > 6 owners }	#owners = 10

Example 4: A program takes in input a dimension statement specifying the name, dimensionality, and index range in each dimension, in the following format :

int arrayname[d1, d2, ..., dn]

where arrayname is the symbolic name of the array. The symbolic name may contain 1 to 6 characters, where the first character must be a letter. [d1, d2, ..., dn] is specifying the size and indice range for each dimension of the array, where 1 < n < 6, and each d is of the form: [lb..ub], where lb = 0 ≤ lowerbound ≤ 65235, ub = 0 ≤ upperbound ≤ 65235, and lb ≤ ub. If lb is not specified then the assumption is that lb = 1:

input condition	valid classe(es)	invalid class(es)
length l of `arrayname`	1 ≤ l ≤ 6 (1)	l = 0 (2) l > 6 (3)
characters of `arrayname`	has letters (4) has digits (5)	has something else (6)
first character of `arrayname` is a char	true (7)	false (8)
n	0 < n < 7 (9)	n = 0 (10) n > 6 (11)
upper bound range	0 ≤ ub ≤ 65235 (12)	ub < 0 (13) ub > 65235 (14)
lower bound range	0 ≤ ub ≤ 65235 (15)	ub < 0 (16) ub > 65235 (17)
lb ≤ ub	lb = ub (18) lb < ub (19)	lb > ub (20)
lb is specified	true (21)	false (22)

Corresponding Test Cases:

#	test case	covered equivalence class(es)
1	`int A[1..6]`	(1),(4),(7),(9),(12),(15),(18),(21)
2	`int A1[1..5]`	(5)
3	`int A1[5..1]`	(19)
4	`int a[]`	(10)
5	`int abcdefg[1..5]`	(3)
6	`int ab%[1..5]`	(6)
7	`int 1ab [1..5]`	(8)
8	`int [1..5]`	(2)
9	`int a[1,2,3,4,5,6,7]`	(11)
10	`int a[70000]`	(14)
11	`int a[-1]`	(13),(20)
12	`int a[-1.. 20000]`	(16)
13	`int a[70000..9000]`	(17)

Boundary Value Analysis

Because of the particularities of the algorithms implemented in the functions, it is more probable that errors will happen at the boundaries of the input domain of the function rather than in the ”center”. It is for this reason that boundary value analysis (BVA) has been developed as a testing technique. Boundary value analysis complements equivalence partitioning by aiming at the selection of test cases that exercise the boundary values of the partitions. Here are guidelines for BVA:

If an input condition specifies a range bounded by values a and b, test cases should be desiged with values a and b and just above and just below a and b.
If an input condition specifies a number of values in input, test cases should be developed that exercise the minimum and maximum number of inputs. Values just above and just below minimum and maximum are also tested.
Apply guidelines 1 and 2 to output conditions. For example, assume that a temperature vs. pressure table is required as output from an engineering analysis program. Test cases should be designed to create an output report that produces the maximum (and minimum) allowable number of table entries.
If internal data structures have prescribed boundaries (e.g. an array has a defined limit of 100 entries), be certain to design a test case to exercise the data structure at its boundaries.

White Box Testing

Definition: Knowing the internal structure and workings of a program, tests are conducted to ensure that ”all gears mesh”, i.e. internal operations are performed according to specifications and all internal components have been adequately exercised.

White box testing is predicated on close examination of procedural detail. Logical paths through software are tested by providing test cases that exercise specific sets of conditions and/or loops. As with black-box testing, for large programs it is impossible to proceed with complete testing. Also as with black-box testing, there are various techniques that can be used to identify a number of important logical paths to test. Using white-box testing methods, it is possible to derive test cases that:

guarantee that all independent paths within a module have been exercised at least once;
exercise all logical decisions on their tru and false sides;
execute all loops at their boundaries and within their operational bounds;
exercise internal data structures to ensure their validity

Basis Path Testing

Basis path testing is a white-box testing technique first proposed in 1976 by Tom McCabe. The basis path method enables the test case designer to derive a logical complexity measure (the cyclomatic complexity) of a procedural design and use this measure as a guide for defining a basis set of execution paths.

When used in the context of the basis path testing method, the value computed for the cyclomatic complexity defines the number of independent paths in the basis set of a program and provides us with an upper bound for the number of tests that must be conducted to ensure that all statements have been executed at least once. Cyclomatic complexity has a foundation in graph theory and provides us with an extremely useful software metric. It is computer in one of three ways:

The number of regions bounded by edges and nodes in the graph corresponds to the cyclomatic complexity. Note that the area outside the graph is counted as a region, i.e. the minimal value for cyclomatic complexity is 1.
Cyclomatic complexity, V(G), for a graph G is defined as: V(G) = E − N + 2 where E is the number of edges, and N is the number of nodes in the graph.
Cyclomatic complexity, V(G), for a graph G is also defined as: V(G) = P + 1 where P is the number of predicate nodes in the graph. Predicate nodes are correspond to nodes where a decision is taken according to a condition. They correspond to nodes that have more than one output edge.

The following procedure can be applied to derive the basis set using the cyclomatic complexity:

Using the design or code as a foundation, draw a corresponding flow graph.
Determine the cyclomatic complexity of the resultant flow graph.
Determine the basis set of linearly independent paths.
Prepare test cases that will force each execution of each path in the basis set.

The procedure for deriving the flow graph and even determining a set of basis paths is amenable to automation. To develop a software tool that assists in basis path testing, a data structure call a graph matrix can be used. Beizer (1990) provides a thorough treatment of mathematical algorithms that can be applied on graph matrices. Using these techniques, the analysis required to design test cases can be partially or fully automated.

Example:

The procedure average, depicted in PDL in this figure, will be used as an example to illustrate each step in the test case design method. Note that this procedure, although a very simple algorithm, contains compound conditions and loops. The following is a description of the procedure used to derive test cases for the procedure average:

1. Using the design or code as a foundation, draw a corresponding flow graph. The corresponding flow graph is presented in this figure. Note that compound conditions have been separated in unitary conditions.

2. Determine the cyclomatic complexity of the resultant flow graph. Only as an example, the cyclomatic complexity is calculated according to the three methods shown above: V(G) = 6 regions = 6 V(G) = 17 edges - 13 nodes + 2 = 6 V(G) = 5 predicated nodes + 1 = 6.

3. Determine the basis set of linearly independent paths. Cyclomatic complexity prescribes for at least 6 different paths to be explored. The following set of paths covers all branches of the graph:

path 1 : 1-2-10-11-13 
path 2 : 1-2-10-12-13 
path 3 : 1-2-3-10-11-13 
path 4 : 1-2-3-4-5-8-9-2-... 
path 5 : 1-2-3-4-5-6-8-9-2-... 
path 6 : 1-2-3-4-5-6-7-8-9-2-...

4. Prepare test cases that will force each execution of each path in the basis set. Test cases that satisfy the basis set just described are:

Path 1 test case: 
   value(k) = valid input, where k < i for 2 ≤ i ≤ 100 
   value(i) = -999 where 2 ≤ i ≤ 100 
   Expected result : Correct average based on k values and proper totals.

Path 2 test case: 
   value(1) = -999 
   Expected result : Average = -999; other totals at initial values.

Path 3 test case: 
   Attempt to process 101 or more values, where the first 100 values are valid. 
   Expected result : Same as test case 1.

Path 4 test case: 
   value(i) = valid input where i < 100. value(k) < minimum where k < i. 
   Expected result : Correct average based on k values and proper totals.

Path 5 test case: 
   value(i) = valid input where i < 100. value(k) > maximum where k ≤ i. 
   Expected result : Correct average based on n vales and proper totals.

Path 6 test case: 
   value(i) = valid input where i < 100. 
   Expected result : Correct average based on n vales and proper totals.

Each test is executed and compared to the expected result. Once all test cases have been completed, the tester can be sure that all program statements have been executed at least once. It is important to note that some independent paths (e.g. path 1 in the above example) cannot be tested in stand-alone fashion. That is, the combinations of data required to traverse the path cannot be achieved in the normal flow of the program. In such cases, these paths are tested as part of another path test.

Condition Testing

Conditional expressions are used in critical parts of programs: in conditional (if) statements, and as exit conditions for loops. Both these constructs are among the more prolific statements in terms of fault occurence, that is why it is normally cost effective to test them. Condition testing is a test case design method that exercises the logical conditions contained in a program module.

A simple condition is a boolean variable or relational expression, possibly preceded by the unary NOT operator. A relational expression is of the form E1 <relationalOperator> E2 where E1 and E2 are arithmetic expressions or simple conditions, A compound condition is composed of two or more condition components (simple conditions or relational expressions) separated by boolean operators. Here are guidelines for condition testing:

For any condition C, the true and false branches of C need to be executed at least once.
For any compound condition C, every component condition in C must be executed at least once, and, ultimately, each permutation of truth values for all component conditions must be exercised.
For any relational expression part of C of the form: E1 <relationalOperator> E2 three tests are required to make the value of E1 greater, equal to, or lesser than the value of E2.

Example:

Consider the three following conditions:

C1 : B1 || B2
C2 : B1 || (E1 = E2)
C3 : (E1 = E2) || (E3 > E4)

All the following cases must be exercised to fully test their correctness:

C1	B1	B2
T	T	T
T	T	F
T	F	T
F	F	F

C2	B1	(E1 rel E2)
T	T	= (T)
T	T	> (F)
T	T	< (F)
T	F	= (T)
F	F	> (F)
F	F	< (F)

C3	(E1 rel E2)	(E3 rel E4)
T	= (T)	= (F)
T	= (T)	> (T)
T	= (T)	< (F)
F	> (F)	= (F)
T	> (F)	> (T)
F	> (F)	< (F)
F	< (F)	= (F)
T	< (F)	> (T)
F	< (F)	< (F)

Loop Testing

Loops are the cornerstone for the vast majority of all algorithms implemented in software. And yet, we often pay them little heed while conducting software tests. loop testing is a white-box testing technique that focuses exclusively on the validity of loop constructs. Four different classes of loops can be defined: simple loops, concatenated loops, nested loops, and unstructured loops.

Simple loops. The following set of tests can be applied to simple loops, where n is the maximum number of iterations through the loop.

Skip the loop entirely.
Only one pass through the loop.
Two passes through the loop.
m passes through the loop, where m < n.
n − 1, n, n + 1 passes through the loop.

Nested loops. If we were to extend the test approach for simple loops to nested loops, the number of tests would grow exponentially as the level of nesting increases. This would result in an impractical number of tests. Beizer (1990) suggests an approach that limits the number of tests:

Start at the innermost loop. Set all other loops to minimum values.
Conduct simple loops tests for the innermost loop while holding the outer loops at their minimum iteration parameter values. Add other tests for out-of-range or excluded values using BVA and/or equivalence partitioning.
Work outward, conducting tests for the next outer loop, but keeping all other outer loops at minimum values and other nested loops to ”typical” values.
Continue until all nesting levels have been tested.

Concatenated loops. Concatenated loops can be tested using the approach defined for simple loops, if each loop is independent of the other. For any loop which is dependent on the counter of another loop, the approach proposed for nested loops is recommended.

Unstructured loops. Unstructured loops are loops for example using break statements that eable different exit points. Whenever possible, this class of loop has to be redesigned to be properly tested, especially when they are nested.

Integration Testing

When components have been tested, they can be incorporated into the growing system. Integration testing means testing a group of components together. There are various ways in which integration testing can be carried out:

Bottom-up Integration. First test the components Cn that don’t use any other components; then test the components Cn −1 that use the components Cn but no others; and so on. In bottom-up integration, the components that a component uses have already been tested and can be used in the test on the assumption that they work correctly. However, we will need to build drivers to call the components, because the code that will eventually call them has not yet been tested. Drivers are written as simply as possible but should realistically exercise the components under test.

Top-down Integration. First test the components C0 that are not called by any other components; then test the components C1 that are called by components C0; and so on. In top-down integration, the components that call the component under test can be used on the assumption that they work correctly. However, we will need to build stubs that simulate the components called by the component under test. Stubs are written as simply as possible: they respond to all reasonable requests but return very simple results.

Big-Bang Integration. First test the components individually, then link all the components together, click Run, and see what happens (usually the system crashes immediately). For systems of any complexity, big-bang integration is unlikely to be efficient. The problem is that, when failures occur, it is hard to figure out what caused them.

Sandwich Integration. This is a combination of top-down and bottom-up, the objective being to converge on a layer in the “middle” of the system for which higher and lower components have already been tested. The following advantages are claimed for sandwich testing:

Integration testing can start early in the testing process, because groups of components can be tested.
Top-level components (control) and bottom-level components (utilities) are both tested early.

It is interesting to note that we can perform neither pure bottom-up nor pure top-down integration testing if there are cycles in the dependency graph. If components C1, C2, . . . , Cn for a cycle, we are forced to test all of the simultaneously. This is a very good reason to avoid cycles in the component dependency graph! The key points for integration testing are:

Each component should be tested exactly once.
A single test should involve as few components as possible, preferably one.
A component should never be modified for the purpose of testing.
It will usually be necessary to write additional code (stubs, drivers, etc.) specifically for testing. It is a good idea to keep this code so that it can be used for testing improved versions of the system after release.

Test Plan

Testing should always proceed according to a predefined plan rather than in an ad hoc fashion. (Ad hoc testing produces odd hacks.) A test plan consists of a collection of tests and a strategy for applying them.

The plan should describe what happens if a test fails. Usually, the failure is noted and tests proceed. However, if the test is a particularly significant one, testing may have to be delayed until the fault is corrected. For example, if the system fails on startup, the testing team have to wait until the problem is fixed.
The plan should describe what happens when the faults revealed by tests have been corrected. A common strategy is called regression testing. Assume that the tests are numbered 1, 2, 3, . . . and that test n fails. After the fault has been corrected, regression testing requires that tests 1, 2, . . . , n must all be repeated. Regression testing takes a long time. If possible it should be automated (see below). Otherwise, a careful analysis should be performed to see if all of the previous tests must be repeated or whether some of them are, in some sense, independent of test n.
Test results must be carefully recorded.

The following are required to establish a workable test plan:

A testing process. A complete definition of how to proceed with testing must be documented and followed carefully. This includes techniques for reviews, test case generation and application, how to generate stubs and drivers, criteria for test completeness, and a set of templates for generating documentation for test plans and test results.
Traceability. Most test cases will be designed to test specific requirements, design, and implementation units. having traceability between all of these will ease the generation of a minimal set of test cases.
Tested items. A list of all the requirements, features, specifications, design elements, and code units that are being tested.
Test input & expected results. For each tested item, we need to design a set of test cases that will properly verify that the item meets all the required quality and functionality standards, that are represented by the results that are expected for each test case input data.
Test recording procedures. As it is likely that thousands of test cases are needed to be executed, a test automation environment (see below) might have to be implemented. This includes all required stubs and drivers, and various scripts to run and evaluate the results of all tests in the right order.
Testing schedule. For each testing phase (all kinds of reviews, unit testing, integration testing, acceptance testing, etc.), a schedule has to be provided that describes in what order all the tests have to be performed, also providing time for the setup of the testing environment (e.g. creation of drivers and stubs for unit and integration testing, setup of a suitable working environment for acceptance testing. etc.).

Test Automation

If possible, testing should be automated. This requires a program that can:

invoke the component under test with suitable test data;
observe the effect of the test;
decide whether to stop or continue, depending on the results for the test.

Automatic testing is fairly straightforward if we are dealing with functions that accept paremeters and return results. It is more complicated if we are testing, for example, a windows-based application. However, there are testing tools that will record a sequence of events (keystrokes, mouse movements, etc.) and “play them back” during testing. Although it is not strictly “testing”, software tools can be used to check code both statically and dynamically.

Static analyzers check that variables are initialized before they are used, that all statements can be reached, that components are linked in appropriate ways, etc. Compilesr perform some static analysis, of course, but they tend to be better at within-component analysis than at between-component analysis.
Test case generators analyze the source code a generate tests based on it. These tools can ensure, for example, that there are tests for each branch of every condition statements. Of course, if the code contains errors, a test case generator may generate inappropriate tests.
Dynamic analyzers make measurements while the code is running. They can collect statistics about the frequency of execution of various blocks and look for potential synchronization problems in concurrent processes.
Capture and replay tools record a sequence of user actions and can play them back later. User actions include keystrokes, mouse movements, and mouse clicks. These tools avoid the time-consuming and error-prone task of repeating a complicated series of actions whenever a test has to be run.

When do you stop testing?

The testing strategy must include a criterion for terminating testing. Ideally, we would terminate testing when there are no faults left. In practice, this would mean testing forever for most systems. We can make the following observations:

Assuming that testing is continuous, the rate of fault detection is likely to fall.
The accumulated cost of testing increases with time.
The value of the product probably decreases with time (for example, competitors get to the market).

At some time, there will be a break-even point where it is better to release the product with residual faults than it is to continue testing. Of course, it is not always easy to decide exactly when the break-even point has been reached!

System Testing

Until now, we have discussed program testing. Once the code is working, we can test the system as a whole. In some cases, the code forms the major part of the system and, when program testing is complete, there is not much more to do. In the majority of cases, however, the code component of a software engineering project is only one part of the total system. System testing has much in common with program testing and there is no need to dwell on the similarities. For example: faults in the system lead to failures; the failures must be identified and the corresponding faults corrected; after correcting the faults, it is advisable to run regression tests to ensure that new faults have not been introduced. Suppose that the product is flight-control software for a new kind of aircraft. The system is the aircraft as a whole. Here is a simplified version of a possible test plan:

Function Testing. The software is thoroughly tested in an environment that simulates the aircraft. That is, the software is provided with inputs close to those that it would receive in actual flight, and it must respond with appropriate control signals. When the function test results have been accepted, we have a functioning system.

Performance Testing. The so-called “non-functional requirements” of the software are thoroughly tested. Does it respond within acceptable time limits? Are there any circumstances in which available memory might be exhausted? Does it degrade gracefully when things go wrong? If there are duplicated subsystems, is control transferred between them when expected? Performance tests are listed in the text book and will not be described in detail here. They include: stress, volume, configuration, compatibility, regression, security, timing, environment, quality, recovery, maintenance, documentation, and human factors (or usability) tests. After performance testing, we have a verified system.

Acceptance Testing. The tests are run for the clients, still on simulated equipment. Experienced pilots will “fly” the simulated aircraft and will practice unusual as well as standard manouvres. After acceptance testing, the clients agree that this is the system that they want. The system is said to be validated.

Installation Testing. The software is installed on an actual aircraft and tested again. These tests are conducted slowly and carefully because of the consequences of an error. Several days may be spent during which the plane travels at less than 100 km/h on the runway. Then there will be short “flights” during which the plane gets a few metres off the ground. Only when everyone is fully confident that flight can be controlled will the test pilot perform a true take-off and fly a short distance. If installation testing goes well, the system is put into use. The maintenance phase begins.

Unique Aspects of System Testing

There are some aspects of system testing that are significantly different from program testing. These are discussed briefly here.

Configuration. The software may have to run in many different configurations. In the worst case, each platform may be unique. Stu Feldman says that,in telecom software, it is not unusual to have several thousand makefiles for a system. All of these must be tested — on a common platform, if possible, but probably on the platform for which they are intended.

Versions. The word “version” is ambiguous because its informal meaning is that of later versions replacing earlier versions. But it is also used in the sense of “version for platform A”, “version for platform B”, etc. Here, we will assume that each “version” is intended for a particular platform and application.

Releases. The word “release” is the formal term that corresponds to the informal use of “version”. The software is issued in a series of numbered releases, each of which is intended to be an improvement on its predecessors. There is usually a distinction between major releases which incorporate significant enhancements and minor releases which correct minor problems. Releases are numbered so that “Release 4.3” would be the third revision of the fourth major release.

Production System. After the system has been released to clients, the developers must keep an identical copy so that they can check client’s complaints. This copy is called the “production system”.

Development System. In addition to the production system, the developers should maintain a separate system into which they can incorporate corrections and improvements. At well-defined points, the “development system” is delivered to the clients, as a new release, and becomes the “production system”. At that time, a new development system is started.

Deltas. Software systems can become very large; maintaining complete copies of all of the files of many different versions can require enormous amounts of storage and — more importantly, since storage is cheap — involve significant copying time. A solution is to maintain one complete set of files for the entire system and, for each version, a set of files that define the differences particular to the version. The difference files are called “deltas”. Use of deltas must be supported by an effective tool to manage them, otherwise disaster ensues. There should be a simple and reliable way of combining the basic system files with a set of delta files to obtain a release. A problem with deltas is that, if the base system is lost, all releases are lost as well. The base system must therefore be regarded as a document of great value. The unix utility SCCS (Source Code Control System) uses deltas to maintain multiple versions of files.