IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2512.19675.html

Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Author

Listed:
  • Niclas Griesshaber
  • Jochen Streb

Abstract

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.

Suggested Citation

  • Niclas Griesshaber & Jochen Streb, 2025. "Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)," Papers 2512.19675, arXiv.org.
  • Handle: RePEc:arx:papers:2512.19675
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2512.19675
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Guinnane, Timothy & Harris, Ron & Lamoreaux, Naomi R. & Rosenthal, Jean-Laurent, 2007. "Putting the Corporation in its Place," Enterprise & Society, Cambridge University Press, vol. 8(3), pages 687-729, September.
    2. Daniel Moulton & Christopher Severen, 2025. "Harvesting Historical Data with LLMs," Economic Insights, Federal Reserve Bank of Philadelphia, vol. 10(4), pages 1-6, December.
    3. Broadberry,Stephen & Campbell,Bruce M. S. & Klein,Alexander & Overton,Mark & van Leeuwen,Bas, 2015. "British Economic Growth, 1270–1870," Cambridge Books, Cambridge University Press, number 9781107676497, January.
    4. Oded Galor, 2011. "Unified Growth Theory and Comparative Development," Rivista di Politica Economica, SIPI Spa, issue 2, pages 9-21, April-Jun.
    5. Verónica Bäcker-Peral & Vitaly Meursault & Christopher Severen, 2025. "Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables," Working Papers 25-28, Federal Reserve Bank of Philadelphia.
    6. Eric Chyn & Kareem Haggag & Christian Maruthiah, 2025. "Ideology in Government: Evidence from the Office of Indian Affairs and the Assimilation Era," NBER Working Papers 34415, National Bureau of Economic Research, Inc.
    7. Anton Korinek, 2025. "AI Agents for Economic Research," NBER Working Papers 34202, National Bureau of Economic Research, Inc.
    8. Jochen Streb, 2024. "The Cliometric Study of Innovations," Springer Books, in: Claude Diebolt & Michael Haupert (ed.), Handbook of Cliometrics, edition 3, pages 2225-2245, Springer.
    9. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    10. David Lagakos & Stelios Michalopoulos & Hans-Joachim Voth, 2025. "American Life Histories," NBER Working Papers 33373, National Bureau of Economic Research, Inc.
    11. Bergeaud, Antonin & Verluise, Cyril, 2024. "A new dataset to study a century of innovation in Europe and in the US," Research Policy, Elsevier, vol. 53(1).
    12. Jason Long & Joseph Ferrie, 2013. "Intergenerational Occupational Mobility in Great Britain and the United States since 1850," American Economic Review, American Economic Association, vol. 103(4), pages 1109-1137, June.
    13. Jason Long & Joseph Ferrie, 2013. "Intergenerational Occupational Mobility in Great Britain and the United States since 1850: Reply," American Economic Review, American Economic Association, vol. 103(5), pages 2041-2049, August.
    14. Moser, Petra, 2011. "Do Patents Weaken the Localization of Innovations? Evidence from World's Fairs," The Journal of Economic History, Cambridge University Press, vol. 71(2), pages 363-382, June.
    15. repec:fth:harver:1473 is not listed on IDEAS
    16. Oded Galor, 2011. "Unified Growth Theory," Economics Books, Princeton University Press, edition 1, number 9477, December.
    17. Mark Humphries & Lianne C. Leddy & Quinn Downton & Meredith Legace & John McConnell & Isabella Murray & Elizabeth Spence, 2025. "Unlocking the archives: Using large language models to transcribe handwritten historical documents," Historical Methods: A Journal of Quantitative and Interdisciplinary History, Taylor & Francis Journals, vol. 58(3), pages 175-193, July.
    18. Albers, Thilo N.H. & Kappner, Kalle, 2023. "Perks and pitfalls of city directories as a micro-geographic data source," Explorations in Economic History, Elsevier, vol. 87(C).
    19. Jacob Carlson & Melissa Dell, 2025. "A Unifying Framework for Robust and Efficient Inference with Unstructured Data," Papers 2505.00282, arXiv.org, revised Feb 2026.
    20. Melissa Dell & Jacob Carlson & Tom Bryan & Emily Silcock & Abhishek Arora & Zejiang Shen & Luca D'Amico-Wong & Quan Le & Pablo Querubin & Leander Heldring, 2023. "American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers," Papers 2308.12477, arXiv.org.
    21. Petra Moser, 2012. "Innovation without Patents: Evidence from World's Fairs," Journal of Law and Economics, University of Chicago Press, vol. 55(1), pages 43-74.
    22. Steven Ruggles, 2014. "Big Microdata for Population Research," Demography, Springer;Population Association of America (PAA), vol. 51(1), pages 287-297, February.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Griesshaber, Niclas & Streb, Jochen, 2026. "Multimodal LLMs for historical dataset construction from archival image scans: German patents (1877-1918)," SAFE Working Paper Series 466, Leibniz Institute for Financial Research SAFE.
    2. Francesco Cinnirella & Jochen Streb, 2017. "The role of human capital and innovation in economic development: evidence from post-Malthusian Prussia," Journal of Economic Growth, Springer, vol. 22(2), pages 193-227, June.
    3. Diego Battiston & Stephan Maurer & Andrei Potlogea & Jose V. Rodriguez Mora, 2025. "The Short and Long Run Dynamics of the Great Gatsby Curve," Edinburgh School of Economics Discussion Paper Series 324, Edinburgh School of Economics, University of Edinburgh.
    4. Elisa Jácome & Ilyana Kuziemko & Suresh Naidu, 2021. "Mobility for All: Representative Intergenerational Mobility Estimates over the 20th Century," Working Papers 302, Princeton University, Department of Economics, Center for Economic Policy Studies..
    5. Guido Alfani, 2022. "Epidemics, Inequality, and Poverty in Preindustrial and Early Industrial Times," Journal of Economic Literature, American Economic Association, vol. 60(1), pages 3-40, March.
    6. Hwang, Sam Il Myoung & Squires, Munir, 2024. "Linked samples and measurement error in historical US census data," Explorations in Economic History, Elsevier, vol. 93(C).
    7. Juliana Jaramillo-Echeverri, 2024. "Movilidad social en la educación: el caso de la Universidad de los Andes en Colombia entre 1949 y 2018," Cuadernos de Historia Económica 61, Banco de la Republica de Colombia.
    8. Jakob B. Madsen & Fabrice Murtin, 2017. "British economic growth since 1270: the role of education," Journal of Economic Growth, Springer, vol. 22(3), pages 229-272, September.
    9. Madsen, Jakob B. & Robertson, Peter E. & Ye, Longfeng, 2019. "Malthus was right: Explaining a millennium of stagnation," European Economic Review, Elsevier, vol. 118(C), pages 51-68.
    10. David Andersson & Mounir Karadja & Erik Prawitz, 2022. "Mass Migration and Technological Change," Journal of the European Economic Association, European Economic Association, vol. 20(5), pages 1859-1896.
    11. Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
    12. Alexandra M. de Pleijt, 2018. "Human capital formation in the long run: evidence from average years of schooling in England, 1300–1900," Cliometrica, Journal of Historical Economics and Econometric History, Association Française de Cliométrie (AFC), vol. 12(1), pages 99-126, January.
    13. Jensen, Peter Sandholt & Pedersen, Maja Uhre & Radu, Cristina Victoria & Sharp, Paul Richard, 2022. "Arresting the Sword of Damocles: The transition to the post-Malthusian era in Denmark," Explorations in Economic History, Elsevier, vol. 84(C).
    14. Lehmann-Hasemeyer, Sibylle H. & Prettner, Klaus & Tscheuschner, Paul, 2020. "The scientific revolution and its role in the transition to sustained economic growth," Hohenheim Discussion Papers in Business, Economics and Social Sciences 06-2020, University of Hohenheim, Faculty of Business, Economics and Social Sciences.
    15. Fiaschi, Davide & Fioroni, Tamara, 2019. "Transition to modern growth in Great Britain: The role of technological progress, adult mortality and factor accumulation," Structural Change and Economic Dynamics, Elsevier, vol. 51(C), pages 472-490.
    16. Madsen, Jakob & Strulik, Holger, 2024. "Inequality and the Industrial Revolution," European Economic Review, Elsevier, vol. 164(C).
    17. James Foreman-Peck & Peng Zhou, 2021. "Correction to: fertility versus productivity: a model of growth with evolutionary equilibria," Journal of Population Economics, Springer;European Society for Population Economics, vol. 34(4), pages 1473-1474, October.
    18. Matteo Cervellati & Gerrit Meyerheim & Uwe Sunde, 2023. "The empirics of economic growth over time and across nations: a unified growth perspective," Journal of Economic Growth, Springer, vol. 28(2), pages 173-224, June.
    19. Alfani, Guido & Gierok, Victoria & Schaff, Felix, 2025. "Poverty in Germany from the Black Death until the Beginning of Industrialization," Explorations in Economic History, Elsevier, vol. 95(C).
    20. Jedwab, Remi & Johnson, Noel D. & Koyama, Mark, 2022. "Medieval cities through the lens of urban economics," Regional Science and Urban Economics, Elsevier, vol. 94(C).

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2512.19675. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.