Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Author

Listed:

Haoyu Wei

Abstract

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. In an auxiliary augmented transition-sampling experiment, we characterize the existence of the efficient influence function and show that non-regularity arises when competing optimal policies havedistinct first-order gradients. For the actual i.i.d.-trajectory experiment, we derive the semiparametric efficiency bound and a uniformly weighted estimator that attains it under a unique optimum, while the sequential NSAVE procedure trades efficiency for stability and validity under non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which yields martingale-based inference and retains a double-robustness property under policy-aligned nuisance estimation. We further develop a pointwise smoothing-based approximation under explicit first-stage rates, and a post-selection template with uniform coverage whenever its stated joint calibration condition is satisfied. Simulation studies support the theoretical results. An application to the Drink Less micro-randomized trial provides confidence intervals for state-adaptive notification policies and their improvement over the randomized behavior policy.

Suggested Citation

Haoyu Wei, 2025. "Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness," Papers 2505.13809, arXiv.org, revised Jun 2026.

Handle: RePEc:arx:papers:2505.13809

Download full text from publisher

References listed on IDEAS

Chengchun Shi & Jin Zhu & Shen Ye & Shikai Luo & Hongtu Zhu & Rui Song, 2024. "Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 119(545), pages 273-284, January.
Bian, Zeyu & Shi, Chengchun & Qi, Zhengling & Wang, Lan, 2025. "Off-policy evaluation in doubly inhomogeneous environments," LSE Research Online Documents on Economics 124630, London School of Economics and Political Science, LSE Library.
Chengchun Shi & Zhengling Qi & Jianing Wang & Fan Zhou, 2024. "Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 119(547), pages 2011-2025, July.
Luo, Shikai & Yang, Ying & Shi, Chengchun & Yao, Fang & Ye, Jieping & Zhu, Hongtu, 2024. "Policy evaluation for temporal and/or spatial dependent experiments," LSE Research Online Documents on Economics 122741, London School of Economics and Political Science, LSE Library.
Shi, Chengchun & Zhang, Shengxing & Lu, Wenbin & Song, Rui, 2022. "Statistical inference of the value function for reinforcement learning in infinite-horizon settings," LSE Research Online Documents on Economics 110882, London School of Economics and Political Science, LSE Library.
Susan Athey & Stefan Wager, 2021. "Policy Learning With Observational Data," Econometrica, Econometric Society, vol. 89(1), pages 133-161, January.
- Susan Athey & Stefan Wager, 2017. "Policy Learning with Observational Data," Papers 1702.02896, arXiv.org, revised Sep 2020.
Chengchun Shi & Sheng Zhang & Wenbin Lu & Rui Song, 2022. "Statistical inference of the value function for reinforcement learning in infinite‐horizon settings," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(3), pages 765-793, July.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Li, Mengbing & Shi, Chengchun & Wu, Zhenke & Fryzlewicz, Piotr, 2025. "Testing stationarity and change point detection in reinforcement learning," LSE Research Online Documents on Economics 127507, London School of Economics and Political Science, LSE Library.
Lan Luo, By & Shi, Chengchun & Wang, Jitao & Wu, Zhenke & Li, Lexin, 2025. "Multivariate dynamic mediation analysis under a reinforcement learning framework," LSE Research Online Documents on Economics 127112, London School of Economics and Political Science, LSE Library.
Zhang, Yingying & Shi, Chengchun & Luo, Shikai, 2023. "Conformal off-policy prediction," LSE Research Online Documents on Economics 118250, London School of Economics and Political Science, LSE Library.
Zhu, Jin & Wan, Runzhe & Qi, Zhengling & Luo, Shikai & Shi, Chengchun, 2024. "Robust offline reinforcement learning with heavy-tailed rewards," LSE Research Online Documents on Economics 122740, London School of Economics and Political Science, LSE Library.
Gao, Yuhe & Shi, Chengchun & Song, Rui, 2023. "Deep spectral Q-learning with application to mobile health," LSE Research Online Documents on Economics 119445, London School of Economics and Political Science, LSE Library.
Asanov, Anastasiya-Mariya & Asanov, Igor & Buenstorf, Guido, 2024. "A low-cost digital first aid tool to reduce psychological distress in refugees: A multi-country randomized controlled trial of self-help online in the first months after the invasion of Ukraine," Social Science & Medicine, Elsevier, vol. 362(C).
Justin Whitehouse & Qizhao Chen & Morgane Austern & Vasilis Syrgkanis, 2025. "Inference on Optimal Policy Values and Other Irregular Functionals via Softmax Smoothing," Papers 2507.11780, arXiv.org, revised Mar 2026.
Yi Zhang & Kosuke Imai, 2023. "Individualized Policy Evaluation and Learning under Clustered Network Interference," Papers 2311.02467, arXiv.org, revised Apr 2025.
Giovanni Cerulli & Francesco Caracciolo, 2025. "Risk-Adjusted Policy Learning and the Social Cost of Uncertainty: Theory and Evidence from CAP evaluation," Papers 2510.05007, arXiv.org.
Manski, Charles F., 2023. "Probabilistic prediction for binary treatment choice: With focus on personalized medicine," Journal of Econometrics, Elsevier, vol. 234(2), pages 647-663.
- Charles F. Manski, 2021. "Probabilistic Prediction for Binary Treatment Choice: with Focus on Personalized Medicine," NBER Working Papers 29358, National Bureau of Economic Research, Inc.
- Charles F. Manski, 2021. "Probabilistic Prediction for Binary Treatment Choice: with focus on personalized medicine," Papers 2110.00864, arXiv.org.
Yan Liu, 2022. "Policy Learning under Endogeneity Using Instrumental Variables," Papers 2206.09883, arXiv.org, revised Jan 2026.
Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
- Gobillon, Laurent & Combes, Pierre-Philippe & Zylberberg, Yanos, 2020. "Urban economics in a historical perspective: Recovering data with machine learning," CEPR Discussion Papers 15308, Centre for Economic Policy Research.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," PSE-Ecole d'économie de Paris (Postprint) halshs-03673240, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2021. "Urban economics in a historical perspective: Recovering data with machine learning," Working Papers halshs-03231786, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," Post-Print halshs-03673240, HAL.
- Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2021. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," IZA Discussion Papers 14392, IZA Network @ LISER.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2021. "Urban economics in a historical perspective: Recovering data with machine learning," PSE Working Papers halshs-03231786, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," Sciences Po Economics Publications (main) halshs-03673240, HAL.
Bokelmann, Björn & Lessmann, Stefan, 2024. "Improving uplift model evaluation on randomized controlled trial data," European Journal of Operational Research, Elsevier, vol. 313(2), pages 691-707.
Garbero, Alessandra & Sakos, Grayson & Cerulli, Giovanni, 2023. "Towards data-driven project design: Providing optimal treatment rules for development projects," Socio-Economic Planning Sciences, Elsevier, vol. 89(C).
- Garbero, Alessandra & Sakos, Grayson & Cerulli, Giovanni, 2021. "Towards Data-driven Project design: Providing Optimal Treatment Rules for Development Projects," 2021 Annual Meeting, August 1-3, Austin, Texas 314016, Agricultural and Applied Economics Association.
Ruohan Zhan & Zhimei Ren & Susan Athey & Zhengyuan Zhou, 2024. "Policy Learning with Adaptively Collected Data," Management Science, INFORMS, vol. 70(8), pages 5270-5297, August.
- Ruohan Zhan & Zhimei Ren & Susan Athey & Zhengyuan Zhou, 2021. "Policy Learning with Adaptively Collected Data," Papers 2105.02344, arXiv.org, revised Nov 2022.
- Zhan, Ruohan & Ren, Zhimei & Athey, Susan & Zhou, Zhengyuan, 2021. "Policy Learning with Adaptively Collected Data," Research Papers 3963, Stanford University, Graduate School of Business.
Ta-Wei Huang & Eva Ascarza, 2024. "Doing More with Less: Overcoming Ineffective Long-Term Targeting Using Short-Term Signals," Marketing Science, INFORMS, vol. 43(4), pages 863-884, July.
Undral Byambadalai, 2022. "Identification and Inference for Welfare Gains without Unconfoundedness," Papers 2207.04314, arXiv.org.
Black, Dan A. & Grogger, Jeffrey & Kirchmaier, Tom & Sanders, Koen, 2023. "Criminal charges, risk assessment and violent recidivism in cases of domestic abuse," LSE Research Online Documents on Economics 121374, London School of Economics and Political Science, LSE Library.
- Black, Dan A. & Grogger, Jeffrey & Kirchmaier, Tom & Sanders, Koen, 2023. "Criminal Charges, Risk Assessment, and Violent Recidivism in Cases of Domestic Abuse," IZA Discussion Papers 15885, IZA Network @ LISER.
- Dan A. Black & Jeffrey Grogger & Tom Kirchmaier & Koen Sanders, 2023. "Criminal charges, risk assessment and violent recidivism in cases of domestic abuse," CEP Discussion Papers dp1897, Centre for Economic Performance, LSE.
- Dan A. Black & Jeffrey Grogger & Tom Kirchmaier & Koen Sanders, 2023. "Criminal Charges, Risk Assessment, and Violent Recidivism in Cases of Domestic Abuse," NBER Working Papers 30884, National Bureau of Economic Research, Inc.
Michael Lechner, 2023. "Causal Machine Learning and its use for public policy," Swiss Journal of Economics and Statistics, Springer;Swiss Society of Economics and Statistics, vol. 159(1), pages 1-15, December.
Kai Feng & Han Hong & Ke Tang & Jingyuan Wang, 2025. "Statistical Tests for Replacing Human Decision Makers with Algorithms," Management Science, INFORMS, vol. 71(11), pages 9145-9170, November.
- Kai Feng & Han Hong & Ke Tang & Jingyuan Wang, 2023. "Statistical Tests for Replacing Human Decision Makers with Algorithms," Papers 2306.11689, arXiv.org, revised Dec 2024.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2505.13809. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: https://arxiv.org/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data