IDEAS home Printed from
   My bibliography  Save this article

Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning


  • Tapas K. Das

    (Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620)

  • Abhijit Gosavi

    (Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620)

  • Sridhar Mahadevan

    (Department of Computer Science, Michigan State University, East Lansing, Michigan 48824)

  • Nicholas Marchalleck

    (Cybear, Inc., 2709 Rocky Pointe Drive, Tampa, Florida 33607)


A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.

Suggested Citation

  • Tapas K. Das & Abhijit Gosavi & Sridhar Mahadevan & Nicholas Marchalleck, 1999. "Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning," Management Science, INFORMS, vol. 45(4), pages 560-574, April.
  • Handle: RePEc:inm:ormnsc:v:45:y:1999:i:4:p:560-574

    Download full text from publisher

    File URL:
    Download Restriction: no


    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

    Cited by:

    1. Li, Xueping & Wang, Jiao & Sawhney, Rapinder, 2012. "Reinforcement learning for joint pricing, lead-time and scheduling decisions in make-to-order systems," European Journal of Operational Research, Elsevier, vol. 221(1), pages 99-109.
    2. Giannoccaro, Ilaria & Pontrandolfo, Pierpaolo, 2002. "Inventory management in supply chains: a reinforcement learning approach," International Journal of Production Economics, Elsevier, vol. 78(2), pages 153-161, July.
    3. Ohno, Katsuhisa, 2011. "The optimal control of just-in-time-based production and distribution systems and performance comparisons with optimized pull systems," European Journal of Operational Research, Elsevier, vol. 213(1), pages 124-133, August.
    4. van Wezel, M.C. & van Eck, N.J.P., 2005. "Reinforcement learning and its application to Othello," Econometric Institute Research Papers EI 2005-47, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
    5. Gosavi, Abhijit, 2004. "Reinforcement learning for long-run average cost," European Journal of Operational Research, Elsevier, vol. 155(3), pages 654-674, June.
    6. Schütz, Hans-Jörg & Kolisch, Rainer, 2012. "Approximate dynamic programming for capacity allocation in the service industry," European Journal of Operational Research, Elsevier, vol. 218(1), pages 239-250.
    7. Prasenjit Mondal, 2016. "On undiscounted semi-Markov decision processes with absorbing states," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 83(2), pages 161-177, April.
    8. Xiao Wang & Hongwei Wang & Chao Qi, 2016. "Multi-agent reinforcement learning based maintenance policy for a resource constrained flow line system," Journal of Intelligent Manufacturing, Springer, vol. 27(2), pages 325-333, April.
    9. Ohno, Katsuhisa & Boh, Toshitaka & Nakade, Koichi & Tamura, Takayoshi, 2016. "New approximate dynamic programming algorithms for large-scale undiscounted Markov decision processes and their application to optimize a production and distribution system," European Journal of Operational Research, Elsevier, vol. 249(1), pages 22-31.


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:ormnsc:v:45:y:1999:i:4:p:560-574. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Mirko Janc). General contact details of provider: .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.