IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2510.24990.html

The Economics of AI Training Data: A Research Agenda

Author

Listed:
  • Hamidah Oderinwale
  • Anna Kazlauskas

Abstract

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.

Suggested Citation

  • Hamidah Oderinwale & Anna Kazlauskas, 2025. "The Economics of AI Training Data: A Research Agenda," Papers 2510.24990, arXiv.org.
  • Handle: RePEc:arx:papers:2510.24990
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2510.24990
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Alessandro Acquisti & Curtis Taylor & Liad Wagman, 2016. "The Economics of Privacy," Journal of Economic Literature, American Economic Association, vol. 54(2), pages 442-492, June.
    2. Charles I. Jones & Christopher Tonetti, 2020. "Nonrivalry and the Economics of Data," American Economic Review, American Economic Association, vol. 110(9), pages 2819-2858, September.
    3. George A. Akerlof, 1970. "The Market for "Lemons": Quality Uncertainty and the Market Mechanism," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 84(3), pages 488-500.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Long Chen & Yadong Huang & Shumiao Ouyang & Wei Xiong, 2021. "The Data Privacy Paradox and Digital Demand," Working Papers 2021-47, Princeton University. Economics Department..
    2. Wang, Huizong & Hao, Yulong & Fu, Qiang, 2024. "Data factor agglomeration and urban green finance: A quasi-natural experiment based on the National Big Data Comprehensive Pilot Zone," International Review of Financial Analysis, Elsevier, vol. 96(PB).
    3. Daron Acemoglu & Ali Makhdoumi & Azarakhsh Malekian & Asu Ozdaglar, 2022. "Too Much Data: Prices and Inefficiencies in Data Markets," American Economic Journal: Microeconomics, American Economic Association, vol. 14(4), pages 218-256, November.
    4. Zhang, Wenkang & Wu, Jing, 2025. "Endogenous growth and data heterogeneity in data economics," Finance Research Letters, Elsevier, vol. 78(C).
    5. Caleb S. Fuller, 2019. "Is the market for digital privacy a failure?," Public Choice, Springer, vol. 180(3), pages 353-381, September.
    6. Catherine E. Tucker, 2023. "The Economics of Privacy: An Agenda," NBER Chapters, in: The Economics of Privacy, pages 5-20, National Bureau of Economic Research, Inc.
    7. Chen, S. & Doerr, S. & Frost, J. & Gambacorta, L. & Shin, H.S., 2023. "The fintech gender gap," Journal of Financial Intermediation, Elsevier, vol. 54(C).
    8. He, Zhiguo & Huang, Jing & Zhou, Jidong, 2023. "Open banking: Credit market competition when borrowers own the data," Journal of Financial Economics, Elsevier, vol. 147(2), pages 449-474.
    9. Jacopo Arpetti & Marco Delmastro, 2021. "The privacy paradox: a challenge to decision theory?," Economia e Politica Industriale: Journal of Industrial and Business Economics, Springer;Associazione Amici di Economia e Politica Industriale, vol. 48(4), pages 505-525, December.
    10. Delbono, Flavio & Reggiani, Carlo & Sandrini, Luca, 2024. "Strategic data sales with partial segment profiling," Information Economics and Policy, Elsevier, vol. 68(C).
    11. MARTENS Bertin, 2020. "An economic perspective on data and platform market power," JRC Working Papers on Digital Economy 2020-09, Joint Research Centre.
    12. Budzinski, Oliver & Gänßle, Sophia & Lindstädt-Dreusicke, Nadine, 2021. "Data (r)evolution - The economics of algorithmic search and recommender services," Ilmenau Economics Discussion Papers 148, Ilmenau University of Technology, Institute of Economics.
    13. Chu, Zhaopeng & Chen, Xin & Yang, Jun, 2025. "Impact of data factor and data integration on economic development: Empirical insights from China," Telecommunications Policy, Elsevier, vol. 49(8).
    14. Olivier Armantier & Sebastian Doerr & Jon Frost & Andreas Fuster & Kelly Shue, 2024. "Nothing to hide? Gender and age differences in willingness to share data," Swiss Finance Institute Research Paper Series 24-99, Swiss Finance Institute.
    15. Oliver Falck & Johannes Koenen, 2020. "Resource “Data”: Economic Benefits of Data Provision," CESifo Forum, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, vol. 21(03), pages 31-41, September.
    16. Budzinski, Oliver & Gruésevaja, Marina & Noskova, Victoriia, 2020. "The economics of the German investigation of Facebook's data collection," Ilmenau Economics Discussion Papers 139, Ilmenau University of Technology, Institute of Economics.
    17. Xia, Yue & Md Johar, Md Gapar, 2025. "How external factors influence organisational digital innovation: Evidence from China," Technology in Society, Elsevier, vol. 81(C).
    18. Wagner, Dirk Nicolas, 2020. "The nature of the Artificially Intelligent Firm - An economic investigation into changes that AI brings to the firm," Telecommunications Policy, Elsevier, vol. 44(6).
    19. Wendy C.Y. Li & Makoto Nirei & Kazufumi Yamana, 2018. "Value of Data: There’s No Such Thing as a Free Lunch in the Digital Economy," BEA Working Papers 0164, Bureau of Economic Analysis.
    20. Shuilin Liu & Xudong Lin & Xiaoli Huang & Hanyang Luo & Sumin Yu, 2023. "Research on Service-Driven Benign Market with Platform Subsidy Strategy," Mathematics, MDPI, vol. 11(2), pages 1-21, January.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2510.24990. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.