Value Iteration And Policy Iteration Algorithms For Markov Decision ProblemElena Pashenkova, Irina Rish (email@example.com) & Rina Dechter (firstname.lastname@example.org)
In this paper we consider computational aspects of decision-theoretic planning modeled by Markov decision processes (MDPs). Commonly used algorithms, such as value iteration (VI) [Bellman, 1957] and several versions of modified policy iteration (MPI) [Puterman, 1994] (a modification of the original Howard's policy iteration (PI) [Howard, 1960]), are compared on a class of problems from the motion planning domain. Policy iteration and its modifications are usually recommended as algorithms demonstrating a better performance than value iteration [Russel & Norvig, 1995], [Puterman, 1994]. However, our results show that their performance is not always superior and depends on the parameters of a problem and the parameters of the algorithms, such as number of iterations in the value determination procedure in MPI. Moreover, policy iteration applied to non-discounted models without special restrictions might not even converge to an optimal policy, as in case of the policy iteration algorithm introduced in [Russel & Norvig, 1995]. We also introduce a new stopping criterion into value iteration based on policy changes. The combined value-policy iteration (CVPI) algorithm proposed in the paper implements this criterion and generates an optimal policy faster then both policy and value iteration algorithms.