(Last modified Tue Jun 03 15:15 2008)
What is testing?
Testing is:
- Exercising a system or component
- On some predetermined input data,
in a predetermined context (environment, history, etc.)
- Capturing behavior and output data
- Checking its validity with a test oracle (human, script, computational)
The goals of testing are, optimistically:
- To show that a system does what it should (positive testing)
- To show that a system doesn't do what it shouldn't (negative testing)
- To increase confidence in a system
- To reduce the risks associated with using a system
- To end up with a better system
The results of testing are, one hopes:
- Identifying inconsistencies between the system or component
and its specification, so that they may be remedied
(verification)
- Providing appropriate confidence that the system or component
meets its specification
(verification)
- Demonstrating that the system or component
meets its performance requirements,
functions under appropriate loads,
handles expected stresses
(specific cases of verification)
- Demonstrating that it satisfies the needs of its stakeholders
(validation)
The fundamental problems with testing are ...
- In general, there is no finite set of tests that is sufficient
(no matter how you define "sufficient").
- "Program testing can be used to show the presence of bugs,
but never to show their absence!" (Edsgar Dijkstra)
Rather than always saying "system or component",
from here on I will just say "system"
but mean that it may be either an entire system or a component of a system
that is being discussed.
Similarly,
instead of "context or environment"
I will just say "context".
The basics
- Test case
-
A test case is a context and input data over a period of time
that will cause a system or component
to produce a result that can be compared against expectations.
- Test set or test suite
-
A test set or suite is a group of test cases.
- Test script
-
A sequence of actions that puts a system through a test case.
The actions may be done manually or automatically.
The script may be:
- a list of steps for a person to take
- a shell script that directly follows the actions a person would take
in running the tests manually from a command prompt
- an input to an interpreter such as perl or Ruby
that sends inputs to the sytem and checks its results
- a program written in an "ordinary" programming language such as Java
- a script from/for a tool
that records and replays keystrokes and mouse clicks,
for testing systems through a GUI.
Load testing is something of a special case;
for this,
automated support is needed to provide
a suitable collection of virtual users
and virtual events for the system to deal with.
- Oracle
-
A person or program that classifies the result of a
test case as either acceptable or unacceptable.
Usually we think of an oracle as being a program,
but often a person is the oracle for a test suite —
people are smarter than programs.
Oracles that are programs are preferable,
especially if the tests will need to be run many times.
A human oracle gets bored and careless,
and often tries to avoid re-running the tests.
An automated oracle may compare a result against
known good results (this is efficient but inflexible),
or may calculate whether a result is acceptable
(this is often inefficient but is flexible).
How do a system and an (automated) oracle to test it differ?
- An oracle doesn't have to be efficient, whereas
a system almost always does.
An oracle only runs during testing,
whereas the system runs every time it's used,
so efficiency is much more important for the system.
- An oracle doesn't have to produce the right result,
it just has to identify it.
Often, it is easier to identify a correct result
than to produce it in the first place
(for example, in NP-complete problems).
- It's desirable that a system be written in such a way
that we can easily convince ourselves it works correctly.
It is essential that an oracle be written in that way,
as otherwise we can only convince ourselves by testing it thoroughly
(using an oracle ...),
and testing is not a good way to convince oneself a program is
absolutely correct.
- Fault-error-failure
-
- Fault
-
The thing that's wrong with the code (or other development artifact)
is a fault.
A fault is there whether the program is running or not.
Also called a defect.
- Error
-
An error is an undesired program state.
The program state is embodied in
the values of variables or attributes,
the set of objects that have been constructed,
the contents of scratch files or other data repositories,
etc.
- Failure
-
A failure is
a behavior or output of a system
that is incorrect.
So in order to observe a failure:
- (Reachability) the location(s) containing the fault
must be reachable,
- (Infection) the program must enter an incorrect state
after executing the location,
and
- (Propagation) the program must propagage the "infected" state
to a location that produces an incorrect output as a result.
Ordinarily,
only failures are visible,
and testing is geared towards identifying failures.
Once you've found a failure,
you then have to work backwards:
usually identifying the error(s) that resulted in the failure,
and then the fault(s) that caused the error in the first place.
It is clear that
- not all faults result in errors,
and
- not all errors result in failures.
To pick an extreme case,
faults in unreachable code never cause errors.
There are a wide range of opinions on how frequently this happens;
one estimate is that on average it is roughly 10%
— only about 10% of faults ever cause errors,
and only about 10% of errors ever cause failures.
(Other researchers' estimates are much higher,
as much as 90% or more.)
- Black-box testing
-
Testing against a specification,
without knowledge of how the system is implemented.
Also called specification-based testing.
Black-box testing
may allow greater test efficiency
(because it can direct testing effort
to the requirements stakeholders care most about),
and facilites test reuse
since black-box tests don't depend on how the system is implemented
and can be reused unchanged if the implementation changes.
- White-box or glass-box testing
-
Testing based on partial or full knowledge of how
system is implemented.
A common type is code-based testing,
in which test cases are selected
to cover the code in various ways.
White-box or glass-box testing
may allow greater test effectiveness
(because it can direct testing effort
based on the implementation,
and after all the implementation is where the faults lie).
- Effectiveness
-
A test case or suite is effective
to the extend that it identifies faults and gives confidence.
- Efficiency
-
A test case or suite is efficient to the extent that
it takes little time, money, or other resources.
- Exhaustive testing
-
Exhaustive testing
is the testing of every possible context and inputs over time
that a system can have.
Except for the simplest systems,
this is impossible to do
because there are an infinite number of possible inputs and contexts
and we only have a finite time for testing.
Even for simple systems with finite input domains and contexts
(such as Dijkstra's example of multiplying two integers),
it is rarely practical to test exhaustively because
it takes so long.
Unfortunately,
exhaustive testing is the only kind that is guaranteed to show
that a system works as it should,
and to uncover all the system's faults.
Consequently, we know that
testing cannot show that a system works as it should
(except for the very simplest systems for which exhaustive testing is possible),
and
testing cannot guarantee that a system has no more faults.
- Selection criterion
-
Since we can't test everything,
we need a selection criterion
to help us choose the cases we will test.
A criterion C is:
- reliable if all test sets chosen by it
either succeed (no failures found)
or fail (one or more failures found)
for a particular system under test;
that is,
it doesn't matter which test set meeting C you choose.
- valid if it could choose an (infinite) test set
that uncovers all possible failures.
It has been shown that there can be no algorithm to find
a reliable, valid test set for a system
(which is too bad, because that's exactly the kind of test set we want).
In practice, there are two main ways to select test cases:
- By recording real use of the system.
For an existing system, these are easy to collect
(one simply records what users do)
but many test cases are needed since
users tend to do the same things over and over
so that the test cases are highly repetitive.
Also,
real use tends not to cover erroneous contexts,
since they don't help users do anything useful.
- By creating synthetic test cases.
These are more difficult to produce,
since they must be manually created by a tester,
(or in some cases generated by an algorithm).
However,
far fewer of them are needed
since the testers can avoid repetition,
and they can include all sorts of erroneous contexts.
The selection can be guided by
software tools that assess the degree to which
the cases meet the selection criteria
(especially for coverage criteria),
but in general
software tools can't select good test cases.
Creation of synthetic test cases produces cases
that are much more effective
and more efficient,
and is much more common in practice.
In practice,
no one method of selecting test cases has proved to be best.
Each method has its strengths and weaknesses.
Even random selection of test cases
has been shown to be competitively effective.
The best results are obtained
by selecting test cases using
more than one method.
- Test requirement
-
A test requirement
specifies a particular element of a system artifact
that must be satisfied or covered by some test case.
- Coverage criterion
-
A coverage criterion
is a selection criterion based on coverage
of the code, the design, the specification, or other artifact of the system.
A converage criterion
imposes a set of of test requirements.
A test set satisfies a coverage criterion
iff each of its test requirements
is satisfied by some test case in the set.
- Subsumption (of one criterion by another)
-
Criterion C subsumes
criterion c
iff every test set that satisfies C also satisfies c.
Levels of testing and the "V" diagram
- Acceptance testing
-
Testing of the entire system against the
stakeholder's requirements.
Compare system testing, which tests
that the parts of the system interact as specified by the design.
- Alpha testing
-
Some ordinary users try out the system at the
development site
- Beta testing
-
Some ordinary users try out the system at their
own site(s)
- Integration testing
-
Testing of two or more parts of the system,
testing that their interactions are consistent with the system design;
assumes that the parts have already passed their individual tests.
Integration testing may be done at as many levels as is convenient;
lower level integration testing is sometimes called
component testing,
and the highest level of integration testing,
integrating the entire system,
is sometimes called system testing.
- Regression testing
-
Testing a new system with the tests
developed for an older version, to show that the new system had
the properties of the older one (behavior or reliability).
Regression test selection is its own specialized and important area.
Since it is often the case that
the new system mostly behaves like the previous version did,
it is highly advantageous to reuse as many test cases as possible,
which means one must identify which ones can be reused.
It also may be desirable,
if time is short,
to identify\ many of the unchanged test cases as skippable
in order to concentrate on the new behavior.
- Unit testing
-
Individual testing of
each of the system's smallest units,
often by the developer that created it
Stopping criteria
Regardless of how the test cases are selected,
one may choose individual test cases one after another
and continue testing until some criterion is met.
The criterion may be:
- A code-based coverage criterion.
We first discussed code-based criteria
in terms of selecting a test set.
However, the same criteria may be used as stopping criteria.
If one is used as a stopping criterion,
then a coverage monitoring tool is often used
to help direct the choices
and decide when enough coverage has been achieved.
Such a tool might show
(for example)
which statements had already been covered,
so that a tester can try to select a case that hits
one or more uncovered statements.
And the tool could show how many and what percent had been covered,
so testing could stop at (say) 95% coverage.
- Based on the rate of failure detection.
By keeping track of how many failures each new test case uncovers,
one might decide to stop when the rate of detection drops below
some threshold.
The threshold can be arbitrary (1 failure for 10 test cases)
or can be determined by a statistical model
(we will not discuss these).
- If the system was seeded with faults,
then based on how many seeded faults remain.
This requires that the system be seeded ahead of time with
some number of known faults.
It assumes that test cases will uncover both known faults and unknown faults,
and that the proportion of seeded faults remaining
gives an idea of how many unknown faults remain.
It requires that the seeding be done "realistically",
so that test cases discover unknown faults and seeded faults
in a roughly constant ratio.
Realistic seeding is difficult and time-consuming,
and automated approaches have not been entirely satisfactory.
Testing concurrent systems
The possible states of a group of concurrent systems
is very high,
due to the large number of possible interleavings
of the actions of each system.
In addition,
it is usually difficult to set up a test case
that will cause a specific interleaving.
It is not uncommon for designers of copncurrent systems
to depend more heavily on analysis using model checking
and other techniques, rather than depending mainly on testing.
Such issues are much more difficult
than those that ordinarily arise for single-process systems.
In conclusion, some challenges
- Putting the system into the necessary initial state
in order to run a test case.
- Duplicating the necessary interleaving of concurrency
for multi-threaded systems.
- Deciding how many test cases are enough.
- Choosing the best test cases.
- Distinguishing correct results from failures.
Sources
- Adrion+Branstad+Cherniavsky1982-vvtc ·
- W. Richards Adrion, Martha A. Branstad, and John C. Cherniavsky. Validation, Verification, and Testing of Computer Software.
ACM Comput. Surv., 14(2):159-192, 1982.
http://dx.doi.org/10.1145/356876.356879
- Amman+Offutt2008-ist
- Paul Ammann and Jeff Offutt.
Introduction to Software Testing.
Cambridge University Press, 2008.
- Goodenough+Gerhart1975-tttd-tse
- John B. Goodenough and Susan L. Gerhart. Toward a Theory of Test Data Selection.
IEEE Transactions on Software Engineering, 1(2):156-173, June, 1975.
- Muccini 2002 slides
-
This handout began from
Dr. Henry Muccini's slides for ICS122, 2002 (used with permission).