Advanced Search
MyIDEAS: Login to save this article or follow this journal

Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning


Author Info

  • Tapas K. Das

    (Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620)

  • Abhijit Gosavi

    (Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620)

  • Sridhar Mahadevan

    (Department of Computer Science, Michigan State University, East Lansing, Michigan 48824)

  • Nicholas Marchalleck

    (Cybear, Inc., 2709 Rocky Pointe Drive, Tampa, Florida 33607)

Registered author(s):


    A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.

    Download Info

    If you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.
    File URL:
    Download Restriction: no

    Bibliographic Info

    Article provided by INFORMS in its journal Management Science.

    Volume (Year): 45 (1999)
    Issue (Month): 4 (April)
    Pages: 560-574

    as in new window
    Handle: RePEc:inm:ormnsc:v:45:y:1999:i:4:p:560-574

    Contact details of provider:
    Postal: 7240 Parkway Drive, Suite 300, Hanover, MD 21076 USA
    Phone: +1-443-757-3500
    Fax: 443-757-3515
    Web page:
    More information through EDIRC

    Related research

    Keywords: semi-Markov decision processes (SMDP); reinforcement learning; average reward; preventive maintenance;


    No references listed on IDEAS
    You can help add them by filling out this form.


    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as in new window

    Cited by:
    1. Ohno, Katsuhisa, 2011. "The optimal control of just-in-time-based production and distribution systems and performance comparisons with optimized pull systems," European Journal of Operational Research, Elsevier, vol. 213(1), pages 124-133, August.
    2. Giannoccaro, Ilaria & Pontrandolfo, Pierpaolo, 2002. "Inventory management in supply chains: a reinforcement learning approach," International Journal of Production Economics, Elsevier, vol. 78(2), pages 153-161, July.
    3. van Wezel, M.C. & van Eck, N.J.P., 2005. "Reinforcement learning and its application to Othello," Econometric Institute Research Papers EI 2005-47, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
    4. Schütz, Hans-Jörg & Kolisch, Rainer, 2012. "Approximate dynamic programming for capacity allocation in the service industry," European Journal of Operational Research, Elsevier, vol. 218(1), pages 239-250.
    5. Li, Xueping & Wang, Jiao & Sawhney, Rapinder, 2012. "Reinforcement learning for joint pricing, lead-time and scheduling decisions in make-to-order systems," European Journal of Operational Research, Elsevier, vol. 221(1), pages 99-109.


    This item is not listed on Wikipedia, on a reading list or among the top items on IDEAS.


    Access and download statistics


    When requesting a correction, please mention this item's handle: RePEc:inm:ormnsc:v:45:y:1999:i:4:p:560-574. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Mirko Janc).

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If references are entirely missing, you can add them using this form.

    If the full references list an item that is present in RePEc, but the system did not link to it, you can help with this form.

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.