Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

Author

Listed:

Yanwei Jia
Xun Yu Zhou

Abstract

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation which involves future trajectories and hence is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

Suggested Citation

Yanwei Jia & Xun Yu Zhou, 2021. "Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms," Papers 2111.11232, arXiv.org, revised Jul 2022.

Handle: RePEc:arx:papers:2111.11232

Download full text from publisher

References listed on IDEAS

R. H. Strotz, 1955. "Myopia and Inconsistency in Dynamic Utility Maximization," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 23(3), pages 165-180.
Duan Li & Wan‐Lung Ng, 2000. "Optimal Dynamic Portfolio Selection: Multiperiod Mean‐Variance Formulation," Mathematical Finance, Wiley Blackwell, vol. 10(3), pages 387-406, July.
Min Dai & Hanqing Jin & Steven Kou & Yuhong Xu, 2021. "A Dynamic Mean-Variance Analysis for Log Returns," Management Science, INFORMS, vol. 67(2), pages 1093-1108, February.
David Silver & Julian Schrittwieser & Karen Simonyan & Ioannis Antonoglou & Aja Huang & Arthur Guez & Thomas Hubert & Lucas Baker & Matthew Lai & Adrian Bolton & Yutian Chen & Timothy Lillicrap & Fan , 2017. "Mastering the game of Go without human knowledge," Nature, Nature, vol. 550(7676), pages 354-359, October.
Yanwei Jia & Xun Yu Zhou, 2021. "Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach," Papers 2108.06655, arXiv.org, revised Feb 2022.
Suleyman Basak & Georgy Chabakauri, 2010. "Dynamic Mean-Variance Asset Allocation," The Review of Financial Studies, Society for Financial Studies, vol. 23(8), pages 2970-3016, August.
- Basak, Suleyman & Chabakauri, Georgy, 2009. "Dynamic Mean-Variance Asset Allocation," CEPR Discussion Papers 7256, C.E.P.R. Discussion Papers.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Zhou Fang, 2023. "Continuous-Time Path-Dependent Exploratory Mean-Variance Portfolio Construction," Papers 2303.02298, arXiv.org.
Wu, Bo & Li, Lingfei, 2024. "Reinforcement learning for continuous-time mean-variance portfolio selection in a regime-switching market," Journal of Economic Dynamics and Control, Elsevier, vol. 158(C).
Yilie Huang & Yanwei Jia & Xun Yu Zhou, 2024. "Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study," Papers 2412.16175, arXiv.org, revised Aug 2025.
Yilie Huang, 2025. "Continuous-Time Reinforcement Learning for Asset-Liability Management," Papers 2509.23280, arXiv.org.
Yanwei Jia, 2024. "Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty," Papers 2404.12598, arXiv.org.
Jodi Dianetti & Giorgio Ferrari & Renyuan Xu, 2024. "Exploratory Optimal Stopping: A Singular Control Formulation," Papers 2408.09335, arXiv.org, revised Oct 2024.
Chen Ziyi & Gu Jia-wen, 2025. "Exploratory Utility Maximization Problem with Tsallis Entropy," Papers 2502.01269, arXiv.org.
Cardo-Miota, Javier & Khadem, Shafi & Bahloul, Mohamed, 2025. "Deep reinforcement learning based electricity bill minimization strategy for residential prosumer," Mathematics and Computers in Simulation (MATCOM), Elsevier, vol. 238(C), pages 296-305.
Junyan Ye & Hoi Ying Wong & Kyunghyun Park, 2025. "Robust Exploratory Stopping under Ambiguity in Reinforcement Learning," Papers 2510.10260, arXiv.org.
Wanting He & Wenyuan Li & Yunran Wei, 2025. "Periodic evaluation of defined-contribution pension fund: A dynamic risk measure approach," Papers 2508.05241, arXiv.org.
Xiangyu Cui & Xun Li & Yun Shi & Si Zhao, 2023. "Discrete-Time Mean-Variance Strategy Based on Reinforcement Learning," Papers 2312.15385, arXiv.org.
Dianetti, Jodi & Ferrari, Giorgio & Xu, Renyuan, 2025. "Exploratory Optimal Stopping: A Singular Control Formulation," Center for Mathematical Economics Working Papers 740, Center for Mathematical Economics, Bielefeld University.
Zhou Fang & Haiqing Xu, 2023. "Option Market Making via Reinforcement Learning," Papers 2307.01814, arXiv.org, revised Mar 2025.
Huy Chau & Duy Nguyen & Thai Nguyen, 2024. "Continuous-time optimal investment with portfolio constraints: a reinforcement learning approach," Papers 2412.10692, arXiv.org.
Min Dai & Yu Sun & Zuo Quan Xu & Xun Yu Zhou, 2024. "Learning to Optimally Stop Diffusion Processes, with Financial Applications," Papers 2408.09242, arXiv.org, revised Aug 2025.
Zhou Fang & Haiqing Xu, 2023. "Over-the-Counter Market Making via Reinforcement Learning," Papers 2307.01816, arXiv.org.
Yanwei Jia & Xun Yu Zhou, 2022. "q-Learning in Continuous Time," Papers 2207.00713, arXiv.org, revised May 2025.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

De Gennaro Aquino, Luca & Sornette, Didier & Strub, Moris S., 2023. "Portfolio selection with exploration of new investment assets," European Journal of Operational Research, Elsevier, vol. 310(2), pages 773-792.
Xiang Meng, 2019. "Dynamic Mean-Variance Portfolio Optimisation," Papers 1907.03093, arXiv.org.
Xiangyu Cui & Xun Li & Duan Li & Yun Shi, 2014. "Time Consistent Behavior Portfolio Policy for Dynamic Mean-Variance Formulation," Papers 1408.6070, arXiv.org, revised Aug 2015.
Ben Hambly & Renyuan Xu & Huining Yang, 2021. "Recent Advances in Reinforcement Learning in Finance," Papers 2112.04553, arXiv.org, revised Feb 2023.
Li, Yongwu & Li, Zhongfei, 2013. "Optimal time-consistent investment and reinsurance strategies for mean–variance insurers with state dependent risk aversion," Insurance: Mathematics and Economics, Elsevier, vol. 53(1), pages 86-97.
Huy Chau & Duy Nguyen & Thai Nguyen, 2024. "Continuous-time optimal investment with portfolio constraints: a reinforcement learning approach," Papers 2412.10692, arXiv.org.
Zongxia Liang & Sheng Wang & Jianming Xia, 2024. "An Integral Equation in Portfolio Selection with Time-Inconsistent Preferences," Papers 2412.02446, arXiv.org, revised Jan 2025.
Fießinger, Felix & Stadje, Mitja, 2025. "Time-consistent asset allocation for risk measures in a Lévy market," European Journal of Operational Research, Elsevier, vol. 321(2), pages 676-695.
Felix Fie{ss}inger & Mitja Stadje, 2023. "Time-Consistent Asset Allocation for Risk Measures in a L\'evy Market," Papers 2305.09471, arXiv.org, revised Feb 2026.
Xue Dong He & Xun Yu Zhou, 2021. "Who Are I: Time Inconsistency and Intrapersonal Conflict and Reconciliation," Papers 2105.01829, arXiv.org.
Agostino Capponi & Sveinn Ólafsson & Thaleia Zariphopoulou, 2022. "Personalized Robo-Advising: Enhancing Investment Through Client Interaction," Management Science, INFORMS, vol. 68(4), pages 2485-2512, April.
Ma, Shuai & Ma, Xiaoteng & Xia, Li, 2023. "A unified algorithm framework for mean-variance optimization in discounted Markov decision processes," European Journal of Operational Research, Elsevier, vol. 311(3), pages 1057-1067.
Dong-Mei Zhu & Jia-Wen Gu & Feng-Hui Yu & Tak-Kuen Siu & Wai-Ki Ching, 2021. "Optimal pairs trading with dynamic mean-variance objective," Mathematical Methods of Operations Research, Springer;Gesellschaft für Operations Research (GOR);Nederlands Genootschap voor Besliskunde (NGB), vol. 94(1), pages 145-168, August.
Tomas Björk & Agatha Murgoci & Xun Yu Zhou, 2014. "Mean–Variance Portfolio Optimization With State-Dependent Risk Aversion," Mathematical Finance, Wiley Blackwell, vol. 24(1), pages 1-24, January.
Simone Cerreia-Vioglio & Fulvio Ortu & Francesco Rotondi & Federico Severino, 2024. "On horizon-consistent mean-variance portfolio allocation," Annals of Operations Research, Springer, vol. 336(1), pages 797-828, May.
Keffert, Henk, 2024. "Robo-advising: Optimal investment with mismeasured and unstable risk preferences," European Journal of Operational Research, Elsevier, vol. 315(1), pages 378-392.
Zongxia Liang & Jianming Xia & Fengyi Yuan, 2023. "Dynamic portfolio selection for nonlinear law-dependent preferences," Papers 2311.06745, arXiv.org, revised Nov 2023.
Wei, Jiaqin & Wang, Tianxiao, 2017. "Time-consistent mean–variance asset–liability management with random coefficients," Insurance: Mathematics and Economics, Elsevier, vol. 77(C), pages 84-96.
Yuchen Li & Zongxia Liang & Shunzhi Pang, 2022. "Continuous-Time Monotone Mean-Variance Portfolio Selection in Jump-Diffusion Model," Papers 2211.12168, arXiv.org, revised May 2024.
Chi Kin Lam & Yuhong Xu & Guosheng Yin, 2016. "Dynamic portfolio selection without risk-free assets," Papers 1602.04975, arXiv.org.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2111.11232. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data