Q-learning and Pontryagin's Minimum Principle
Prashant Mehta and Sean Meyn
Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. It has proven to be effective for models with finite state and action space. This paper establishes connections between Q-learning and nonlinear control of continuous-time models with general state space and general action space. The main contributions are summarized as follows:
- The starting point is the observation that the "Q-function" appearing in Q-learning algorithms is an extension of the Hamiltonian that appears in the Minimum Principle. Based on this observation we introduce the steepest descent Q-learning algorithm to obtain the optimal approximation of the Hamiltonian within a prescribed function class.
- A transformation of the optimality equations is performed based on the adjoint of a resolvent operator. This is used to construct a consistent algorithm based on stochastic approximation that requires only causal filtering of observations.
- Several examples are presented to illustrate the application of these techniques, including application to distributed control of multi-agent systems.
Comparison of the optimal policy and the policy obtained from the SDQ(g) algorithm for a scalar example. The approximation is more accurate in the plot shown on the left because the amplitude of the input is larger. The areas in purple indicate state-input pairs that have higher density with the given input signal.
Sample paths of estimates of optimal parameters using the SDQ(g) algorithm for a multi-agent model introduced in Huang et. al. 2007. The dashed lines show the asymptotically optimal values obtained in this prior work.
References
@inproceedings{mehmey09a,
Author = {Mehta, P. G. and Meyn, S. P.},
Booktitle = {Proceedings of the 48th IEEE Conference on Decision and Control. Held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009.},
Month = {Dec.},
Pages = {3598-3605},
Title = {Q-learning and {Pontryagin's} Minimum Principle},
Year = {2009}}
See also Chapter 11 of CTCN, and
@inproceedings{melmeyrib08,
Author = {Melo, F. S. and Meyn, S. and Ribeiro, M. Isabel},
Booktitle = {Proceedings of {ICML} },
Ee = {http://doi.acm.org/10.1145/1390156.1390240},
Pages = {664-671},
Title = {An analysis of reinforcement learning with function approximation},
Year = {2008}}
@inproceedings{shimey11,
Author = {D. Shirodkar and S. Meyn},
Booktitle = {American Control Conference, 2011. ACC '11.},
Month = {June},
Title = {Quasi Stochastic Approximation},
Year = {2011}}