IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v208y2013i1p95-13210.1007-s10479-012-1128-z.html
   My bibliography  Save this article

Q-learning and policy iteration algorithms for stochastic shortest path problems

Author

Listed:
  • Huizhen Yu
  • Dimitri Bertsekas

Abstract

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012 ). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. Copyright Springer Science+Business Media, LLC 2013

Suggested Citation

  • Huizhen Yu & Dimitri Bertsekas, 2013. "Q-learning and policy iteration algorithms for stochastic shortest path problems," Annals of Operations Research, Springer, vol. 208(1), pages 95-132, September.
  • Handle: RePEc:spr:annopr:v:208:y:2013:i:1:p:95-132:10.1007/s10479-012-1128-z
    DOI: 10.1007/s10479-012-1128-z
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1007/s10479-012-1128-z
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1007/s10479-012-1128-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Dimitri P. Bertsekas & John N. Tsitsiklis, 1991. "An Analysis of Stochastic Shortest Path Problems," Mathematics of Operations Research, INFORMS, vol. 16(3), pages 580-595, August.
    2. Dimitri P. Bertsekas & Huizhen Yu, 2012. "Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming," Mathematics of Operations Research, INFORMS, vol. 37(1), pages 66-94, February.
    3. Eugene A. Feinberg, 1992. "On Stationary Strategies in Borel Dynamic Programming," Mathematics of Operations Research, INFORMS, vol. 17(2), pages 392-397, May.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Dimitri P. Bertsekas, 2018. "Proximal algorithms and temporal difference methods for solving fixed point problems," Computational Optimization and Applications, Springer, vol. 70(3), pages 709-736, July.
    2. Jorge Visca & Javier Baliosian, 2022. "rl4dtn: Q-Learning for Opportunistic Networks," Future Internet, MDPI, vol. 14(12), pages 1-17, November.
    3. Dimitri P. Bertsekas, 2019. "Robust shortest path planning and semicontractive dynamic programming," Naval Research Logistics (NRL), John Wiley & Sons, vol. 66(1), pages 15-37, February.
    4. Huizhen Yu & Dimitri P. Bertsekas, 2015. "A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies," Mathematics of Operations Research, INFORMS, vol. 40(4), pages 926-968, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dimitri P. Bertsekas, 2019. "Robust shortest path planning and semicontractive dynamic programming," Naval Research Logistics (NRL), John Wiley & Sons, vol. 66(1), pages 15-37, February.
    2. Raymond K. Cheung & B. Muralidharan, 2000. "Dynamic Routing for Priority Shipments in LTL Service Networks," Transportation Science, INFORMS, vol. 34(1), pages 86-98, February.
    3. E. Nikolova & N. E. Stier-Moses, 2014. "A Mean-Risk Model for the Traffic Assignment Problem with Stochastic Travel Times," Operations Research, INFORMS, vol. 62(2), pages 366-382, April.
    4. Eric A. Hansen, 2017. "Error bounds for stochastic shortest path problems," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 86(1), pages 1-27, August.
    5. Fernando Ordóñez & Nicolás E. Stier-Moses, 2010. "Wardrop Equilibria with Risk-Averse Users," Transportation Science, INFORMS, vol. 44(1), pages 63-86, February.
    6. Matthew H. Henry & Yacov Y. Haimes, 2009. "A Comprehensive Network Security Risk Model for Process Control Networks," Risk Analysis, John Wiley & Sons, vol. 29(2), pages 223-248, February.
    7. Carey E. Priebe & Donniell E. Fishkind & Lowell Abrams & Christine D. Piatko, 2005. "Random disambiguation paths for traversing a mapped hazard field," Naval Research Logistics (NRL), John Wiley & Sons, vol. 52(3), pages 285-292, April.
    8. A. Y. Golubin, 2003. "A Note on the Convergence of Policy Iteration in Markov Decision Processes with Compact Action Spaces," Mathematics of Operations Research, INFORMS, vol. 28(1), pages 194-200, February.
    9. Pretolani, Daniele, 2000. "A directed hypergraph model for random time dependent shortest paths," European Journal of Operational Research, Elsevier, vol. 123(2), pages 315-324, June.
    10. Azadian, Farshid & Murat, Alper E. & Chinnam, Ratna Babu, 2012. "Dynamic routing of time-sensitive air cargo using real-time information," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 48(1), pages 355-372.
    11. Emin Karagözoglu & Cagri Saglam & Agah R. Turan, 2020. "Tullock Brings Perseverance and Suspense to Tug-of-War," CESifo Working Paper Series 8103, CESifo.
    12. Dolinskaya, Irina & Shi, Zhenyu (Edwin) & Smilowitz, Karen, 2018. "Adaptive orienteering problem with stochastic travel times," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 109(C), pages 1-19.
    13. Arthur Flajolet & Sébastien Blandin & Patrick Jaillet, 2018. "Robust Adaptive Routing Under Uncertainty," Operations Research, INFORMS, vol. 66(1), pages 210-229, January.
    14. Benkert, Jean-Michel & Letina, Igor & Nöldeke, Georg, 2018. "Optimal search from multiple distributions with infinite horizon," Economics Letters, Elsevier, vol. 164(C), pages 15-18.
    15. B. Curtis Eaves & Arthur F. Veinott, 2014. "Maximum-Stopping-Value Policies in Finite Markov Population Decision Chains," Mathematics of Operations Research, INFORMS, vol. 39(3), pages 597-606, August.
    16. Daniel Lücking & Wolfgang Stadje, 2013. "The stochastic shortest-path problem for Markov chains with infinite state space with applications to nearest-neighbor lattice chains," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 77(2), pages 239-264, April.
    17. Blai Bonet, 2007. "On the Speed of Convergence of Value Iteration on Stochastic Shortest-Path Problems," Mathematics of Operations Research, INFORMS, vol. 32(2), pages 365-373, May.
    18. Cervellera, Cristiano & Caviglione, Luca, 2009. "Optimization of a peer-to-peer system for efficient content replication," European Journal of Operational Research, Elsevier, vol. 196(2), pages 423-433, July.
    19. Chris P. Lee & Glenn M. Chertow & Stefanos A. Zenios, 2008. "Optimal Initiation and Management of Dialysis Therapy," Operations Research, INFORMS, vol. 56(6), pages 1428-1449, December.
    20. Dimitri P. Bertsekas, 2018. "Proximal algorithms and temporal difference methods for solving fixed point problems," Computational Optimization and Applications, Springer, vol. 70(3), pages 709-736, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:208:y:2013:i:1:p:95-132:10.1007/s10479-012-1128-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.