IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2604.07355.html

Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets

Author

Listed:
  • Jaden Zhang
  • Gardenia Liu
  • Oliver Johansson
  • Hileamlak Yitayew
  • Kamryn Ohly
  • Grace Li

Abstract

We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that cannot be gamed or overfitted. Each model operates as an independent agent starting with $10,000, making autonomous decisions every 15-45 minutes. Over a 57-day longitudinal evaluation (January 12 to March 9, 2026), we track two cohorts: six frontier models in live trading (Cohort 1, full period) and four next-generation models in paper trading (Cohort 2, 3-day preliminary). For Cohort 1, final Kalshi returns range from -16.0% to -30.8%. Our analysis identifies a clear performance hierarchy: initial prediction accuracy and the ability to capitalize on correct predictions are the main drivers, while research volume shows no correlation with outcomes. A striking cross-platform contrast emerges from parallel Polymarket live trading: Cohort 1 models averaged only -1.1% on Polymarket vs. -22.6% on Kalshi, with grok-4-20-checkpoint achieving a 71.4% settlement win rate - the highest across any platform or cohort. gemini-3.1-pro-preview (Cohort 2), which executed zero trades on Kalshi, achieved +6.02% on Polymarket in 3 days - the best return of any model across either cohort - demonstrating that platform design has a profound effect on which models succeed. Beyond performance, we analyze computational efficiency (token usage, cycle time), settlement accuracy, exit patterns, and market preferences, providing a comprehensive view of how frontier models behave under real financial pressure.

Suggested Citation

  • Jaden Zhang & Gardenia Liu & Oliver Johansson & Hileamlak Yitayew & Kamryn Ohly & Grace Li, 2026. "Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets," Papers 2604.07355, arXiv.org.
  • Handle: RePEc:arx:papers:2604.07355
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2604.07355
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Robin Hanson, 2007. "Logarithmic Market Scoring Rules for Modular Combinatorial Information Aggregation," Journal of Prediction Markets, University of Buckingham Press, vol. 1(1), pages 3-15, February.
    2. Berg, Joyce E. & Nelson, Forrest D. & Rietz, Thomas A., 2008. "Prediction market accuracy in the long run," International Journal of Forecasting, Elsevier, vol. 24(2), pages 285-300.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Galanis Spyros & Kotronis Stelios, 2021. "Updating Awareness and Information Aggregation," The B.E. Journal of Theoretical Economics, De Gruyter, vol. 21(2), pages 613-635, June.
    2. Dian Yu & Jianjun Gao & Weiping Wu & Zizhuo Wang, 2022. "Price Interpretability of Prediction Markets: A Convergence Analysis," Papers 2205.08913, arXiv.org, revised Nov 2023.
    3. Spyros Galanis & Christos A Ioannou & Stelios Kotronis, 2024. "Information Aggregation Under Ambiguity: Theory and Experimental Evidence," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 91(6), pages 3423-3467.
    4. Spyros Galanis & Sergei Mikhalishchev, 2024. "Information Aggregation with Costly Information Acquisition," Papers 2406.07186, arXiv.org, revised Apr 2026.
    5. Florian Teschner & David Rothschild & Henner Gimpel, 2017. "Manipulation in Conditional Decision Markets," Group Decision and Negotiation, Springer, vol. 26(5), pages 953-971, September.
    6. Edoardo Gaffeo, 2013. "Using information markets in grantmaking. An assessment of the issues involved and an application to Italian banking foundations," DEM Discussion Papers 2013/08, Department of Economics and Management.
    7. Jianjun Gao & Zizhuo Wang & Weiping Wu & Dian Yu, 2025. "Price Interpretability of Prediction Markets: A Convergence Analysis," Operations Research, INFORMS, vol. 73(1), pages 157-177, January.
    8. Sung, Ming-Chien & McDonald, David C.J. & Johnson, Johnnie E.V. & Tai, Chung-Ching & Cheah, Eng-Tuck, 2019. "Improving prediction market forecasts by detecting and correcting possible over-reaction to price movements," European Journal of Operational Research, Elsevier, vol. 272(1), pages 389-405.
    9. Forsell, Eskil & Viganola, Domenico & Pfeiffer, Thomas & Almenberg, Johan & Wilson, Brad & Chen, Yiling & Nosek, Brian A. & Johannesson, Magnus & Dreber, Anna, 2019. "Predicting replication outcomes in the Many Labs 2 study," Journal of Economic Psychology, Elsevier, vol. 75(PA).
    10. Strijbis, Oliver & Arnesen, Sveinung, 2019. "Explaining variance in the accuracy of prediction markets," International Journal of Forecasting, Elsevier, vol. 35(1), pages 408-419.
    11. Berg, Joyce E. & Rietz, Thomas A., 2019. "Longshots, overconfidence and efficiency on the Iowa Electronic Market," International Journal of Forecasting, Elsevier, vol. 35(1), pages 271-287.
    12. Siemroth, Christoph, 2014. "Why prediction markets work : The role of information acquisition and endogenous weighting," Working Papers 14-02, University of Mannheim, Department of Economics.
    13. Rafael Frongillo, 2022. "Quantum Information Elicitation," Papers 2203.07469, arXiv.org.
    14. Karimi, Majid & Zaerpour, Nima, 2022. "Put your money where your forecast is: Supply chain collaborative forecasting with cost-function-based prediction markets," European Journal of Operational Research, Elsevier, vol. 300(3), pages 1035-1049.
    15. Denter, Philipp & Sisak, Dana, 2015. "Do polls create momentum in political competition?," Journal of Public Economics, Elsevier, vol. 130(C), pages 1-14.
    16. Mikuláš Gangur & Miroslav Plevný, 2014. "Tools for Consumer Rights Protection in the Prediction of Electronic Virtual Market and Technological Changes," The AMFITEATRU ECONOMIC journal, Academy of Economic Studies - Bucharest, Romania, vol. 16(36), pages 578-578, May.
    17. Bergemann, Dirk & Ottaviani, Marco, 2021. "Information Markets and Nonmarkets," CEPR Discussion Papers 16459, Centre for Economic Policy Research.
    18. Przemys{l}aw Rola, 2025. "Boltzmann Price: Toward Understanding the Fair Price in High-Frequency Markets," Papers 2507.09734, arXiv.org.
    19. Patrick Buckley & Fergal O’Brien, 0. "The effect of malicious manipulations on prediction market accuracy," Information Systems Frontiers, Springer, vol. 0, pages 1-13.
    20. Khan, Urmee & Lieli, Robert P., 2018. "Information flow between prediction markets, polls and media: Evidence from the 2008 presidential primaries," International Journal of Forecasting, Elsevier, vol. 34(4), pages 696-710.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2604.07355. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.