Abstract
In this paper we consider computational aspects of decision-theoretic planning
modeled by Markov decision processes (MDPs). Commonly used algorithms, such as
value iteration (VI) [Bellman, 1957] and several versions of modified policy
iteration (MPI) [Puterman, 1994] (a modification of the original Howard's policy
iteration (PI) [Howard, 1960]), are compared on a class of problems from the motion
planning domain. Policy iteration and its modifications are usually recommended
as algorithms demonstrating a better performance than value iteration
[Russel & Norvig, 1995], [Puterman, 1994]. However, our results show that their
performance is not always superior and depends on the parameters of a problem
and the parameters of the algorithms, such as number of iterations in the value determination
procedure in MPI. Moreover, policy iteration applied to non-discounted
models without special restrictions might not even converge to an optimal policy, as
in case of the policy iteration algorithm introduced in [Russel & Norvig, 1995]. We
also introduce a new stopping criterion into value iteration based on policy changes.
The combined value-policy iteration (CVPI) algorithm proposed in the paper implements
this criterion and generates an optimal policy faster then both policy and
value iteration algorithms.
[ps]
[pdf]
|