Author
Listed:
- Cesar Andrey Perdomo Charry
(Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, Colombia)
- Marlon Sneider Mora Cortes
(Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, Colombia)
- Oscar J. Perdomo
(Deptartment of Electrical and Electronic, Universidad Nacional de Colombia, Bogotá 111321, Colombia)
Abstract
The current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a significant challenge. This document proposes a comparative analysis between the Q-Learning algorithm, which laid the foundations for Deep Q-Learning, and our proposed method, termed M-Learning. The comparison is conducted using Markov Decision Processes with the delayed reward as a general test bench framework. Firstly, this document provides a full description of the main challenges related to implementing Q-Learning, particularly concerning its multiple parameters. Then, the foundations of our proposed heuristic are presented, including its formulation, and the algorithm is described in detail. The methodology used to compare both algorithms involved training them in the Frozen Lake environment. The experimental results, along with an analysis of the best solutions, demonstrate that our proposal requires fewer episodes and exhibits reduced variability in the outcomes. Specifically, M-Learning trains agents 30.7% faster in the deterministic environment and 61.66% faster in the stochastic environment. Additionally, it achieves greater consistency, reducing the standard deviation of scores by 58.37% and 49.75% in the deterministic and stochastic settings, respectively. The code will be made available in a GitHub repository upon this paper’s publication.
Suggested Citation
Cesar Andrey Perdomo Charry & Marlon Sneider Mora Cortes & Oscar J. Perdomo, 2025.
"M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning,"
Mathematics, MDPI, vol. 13(13), pages 1-21, June.
Handle:
RePEc:gam:jmathe:v:13:y:2025:i:13:p:2108-:d:1689127
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:13:y:2025:i:13:p:2108-:d:1689127. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.