TD-Learning with Exploration

Sean P. Meyn and Amit Surana

Invited lecture at the CDC 2011 session on Approximate/Adaptive Dynamic Programming and Its Applications to Feedback Control
Organizers: Zhong-Ping Jiang and Frank Lewis

Abstract: We introduce exploration in the TD-learning algorithm to approximate the value function for a given policy. In this way we can modify the norm used for approximation, "zooming in" to a region of interest in the state space. We also provide extensions to SARSA to eliminate the need for numerical integration in policy improvement. Construction of the algorithm and its analysis build on recent general results concerning the spectral theory of Markov chains and positive operators.

Acknowledgement: Financial support from UTRC, AFOSR grant FA9550-09-1-0190, ITMANET DARPA RK 2006-07284, and NSF grant CPS 0931416 is gratefully acknowledged.


Bellman Error w/ and w/o exploration

Bellman Error using LSTD, with and without exploration, in the dynamic speed scaling model. The column shown at right shows the results of approximate policy iteration. With or without exploration, in this example we see rapid convergence, in the sense that the Bellman error in the discounted-cost optimality equation becomes small after just four or five iterations. The column on the left shows the pair of histographs for the state, with and without exploration. Exploration led to a distribution with broader support. We see that exploration reduces the fixed-policy Bellman error significantly for larger values of x.



Title = {{TD}-Learning with Exploration},
Author = {A. Surana and S. Meyn},
Note = {Submitted to IEEE Conference on Decision and Control},
Year = {2011}}

Author = {Chen, Wei and Huang, Dayu and Kulkarni, Ankur A. and Unnikrishnan, Jayakrishnan and Zhu, Quanyan and Mehta, Prashant and Meyn, Sean and Wierman, Adam},
Booktitle = {Proceedings of the 48th IEEE Conference on Decision and Control, held jointly with the 2009 28th Chinese Control Conference},
Month = {Dec.},
Pages = {3575-3580},
Title = {Approximate dynamic programming using fluid and diffusion approximations with applications to power management},
Year = {2009}}

See also Chapter 11 of CTCN, and

Title = {Q-learning and {Pontryagin's} Minimum Principle},
Author = {Mehta, P. G. and Meyn, S. P.},
Booktitle = CDC2009,
Pages = {3598-3605},
Year = {2009}}

Title = {Optimal Cross-layer Wireless Control Policies using {TD}-Learning},
Author = {Meyn, Sean and Chen, Wei and O'Neill, Daniel},
Booktitle = CDC2010,
Pages = {1951 -1956},
Year = {2010}}

Title = {Quasi Stochastic Approximation},
Author = {D. Shirodkar and S. Meyn},
Booktitle = {American Control Conference, 2011. ACC '11.},
Month = {June},
Year = {2011}}