Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Author

Listed:

Niclas Griesshaber
Jochen Streb

Abstract

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.

Suggested Citation

Niclas Griesshaber & Jochen Streb, 2025. "Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)," Papers 2512.19675, arXiv.org.

Handle: RePEc:arx:papers:2512.19675

Download full text from publisher

References listed on IDEAS

Guinnane, Timothy & Harris, Ron & Lamoreaux, Naomi R. & Rosenthal, Jean-Laurent, 2007. "Putting the Corporation in its Place," Enterprise & Society, Cambridge University Press, vol. 8(3), pages 687-729, September.
- Timothy Guinnane & Ron Harris & Naomi R. Lamoreaux & Jean-Laurent Rosenthal, 2007. "Putting the Corporation in its Place," NBER Working Papers 13109, National Bureau of Economic Research, Inc.
Daniel Moulton & Christopher Severen, 2025. "Harvesting Historical Data with LLMs," Economic Insights, Federal Reserve Bank of Philadelphia, vol. 10(4), pages 1-6, December.
Broadberry,Stephen & Campbell,Bruce M. S. & Klein,Alexander & Overton,Mark & van Leeuwen,Bas, 2015. "British Economic Growth, 1270â€“1870," Cambridge Books, Cambridge University Press, number 9781107676497, August.
- Broadberry,Stephen & Campbell,Bruce M. S. & Klein,Alexander & Overton,Mark & van Leeuwen,Bas, 2015. "British Economic Growth, 1270â€“1870," Cambridge Books, Cambridge University Press, number 9781107070783.
Oded Galor, 2011. "Unified Growth Theory and Comparative Development," Rivista di Politica Economica, SIPI Spa, issue 2, pages 9-21, April-Jun.
- Oded Galor, 2010. "Unified Growth Theory and Comparative Development," Working Papers 2010-19, Brown University, Department of Economics.
Verónica Bäcker-Peral & Vitaly Meursault & Christopher Severen, 2025. "Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables," Working Papers 25-28, Federal Reserve Bank of Philadelphia.
- Ver'onica Backer-Peral & Vitaly Meursault & Christopher Severen, 2025. "Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables?," Papers 2505.11599, arXiv.org, revised Jul 2026.
Eric Chyn & Kareem Haggag & Christian Maruthiah, 2025. "Ideology in Government: Evidence from the Office of Indian Affairs and the Assimilation Era," NBER Working Papers 34415, National Bureau of Economic Research, Inc.
Anton Korinek, 2025. "AI Agents for Economic Research," NBER Working Papers 34202, National Bureau of Economic Research, Inc.
Jochen Streb, 2024. "The Cliometric Study of Innovations," Springer Books, in: Claude Diebolt & Michael Haupert (ed.), Handbook of Cliometrics, edition 3, pages 2225-2245, Springer.
Lagakos, David & Michalopoulos, Stelios & Voth, Hans-Joachim, 2025. "American Life Histories," CEPR Discussion Papers 19885, Centre for Economic Policy Research.
Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
- Ran Abramitzky & Leah Platt Boustan & Katherine Eriksson & James J. Feigenbaum & Santiago Pérez, 2019. "Automated Linking of Historical Data," NBER Working Papers 25825, National Bureau of Economic Research, Inc.
David Lagakos & Stelios Michalopoulos & Hans-Joachim Voth, 2025. "American Life Histories," NBER Working Papers 33373, National Bureau of Economic Research, Inc.
Bergeaud, Antonin & Verluise, Cyril, 2024. "A new dataset to study a century of innovation in Europe and in the US," Research Policy, Elsevier, vol. 53(1).
- Antonin Bergeaud & Cyril Verluise, 2022. "A new dataset to study a century of innovation in Europe and the US," POID Working Papers 033, Centre for Economic Performance, LSE.
- Antonin Bergeaud & Cyril Verluise, 2022. "A new dataset to study a century of innovation in Europe and in the US," CEP Discussion Papers dp1850, Centre for Economic Performance, LSE.
- Bergeaud, Antonin & Verluise, Cyril, 2022. "A new dataset to study a century of innovation in Europe and in the US," LSE Research Online Documents on Economics 117858, London School of Economics and Political Science, LSE Library.
Jason Long & Joseph Ferrie, 2013. "Intergenerational Occupational Mobility in Great Britain and the United States since 1850," American Economic Review, American Economic Association, vol. 103(4), pages 1109-1137, June.
Jason Long & Joseph Ferrie, 2013. "Intergenerational Occupational Mobility in Great Britain and the United States since 1850: Reply," American Economic Review, American Economic Association, vol. 103(5), pages 2041-2049, August.
Moser, Petra, 2011. "Do Patents Weaken the Localization of Innovations? Evidence from World's Fairs," The Journal of Economic History, Cambridge University Press, vol. 71(2), pages 363-382, June.
repec:fth:harver:1473 is not listed on IDEAS
Oded Galor, 2011. "Unified Growth Theory," Economics Books, Princeton University Press, edition 1, number 9477, December.
- Oded Galor, 2005. "Unified Growth Theory," Development and Comp Systems 0504001, University Library of Munich, Germany.
Mark Humphries & Lianne C. Leddy & Quinn Downton & Meredith Legace & John McConnell & Isabella Murray & Elizabeth Spence, 2025. "Unlocking the archives: Using large language models to transcribe handwritten historical documents," Historical Methods: A Journal of Quantitative and Interdisciplinary History, Taylor & Francis Journals, vol. 58(3), pages 175-193, July.
Albers, Thilo N.H. & Kappner, Kalle, 2023. "Perks and pitfalls of city directories as a micro-geographic data source," Explorations in Economic History, Elsevier, vol. 87(C).
Jacob Carlson & Melissa Dell, 2025. "A Unifying Framework for Robust and Efficient Inference with Unstructured Data," Papers 2505.00282, arXiv.org, revised Feb 2026.
Melissa Dell & Jacob Carlson & Tom Bryan & Emily Silcock & Abhishek Arora & Zejiang Shen & Luca D'Amico-Wong & Quan Le & Pablo Querubin & Leander Heldring, 2023. "American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers," Papers 2308.12477, arXiv.org.
Petra Moser, 2012. "Innovation without Patents: Evidence from World's Fairs," Journal of Law and Economics, University of Chicago Press, vol. 55(1), pages 43-74.
Steven Ruggles, 2014. "Big Microdata for Population Research," Demography, Springer;Population Association of America (PAA), vol. 51(1), pages 287-297, February.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Griesshaber, Niclas & Streb, Jochen, 2026. "Multimodal LLMs for historical dataset construction from archival image scans: German patents (1877-1918)," SAFE Working Paper Series 466, Leibniz Institute for Financial Research SAFE.
Francesco Cinnirella & Jochen Streb, 2017. "The role of human capital and innovation in economic development: evidence from post-Malthusian Prussia," Journal of Economic Growth, Springer, vol. 22(2), pages 193-227, June.
Diego Battiston & Stephan Maurer & Andrei Potlogea & Jose V. Rodriguez Mora, 2025. "The Short and Long Run Dynamics of the Great Gatsby Curve," Edinburgh School of Economics Discussion Paper Series 324, Edinburgh School of Economics, University of Edinburgh.
- Battiston, Diego & Maurer, Stephan & Potlogea, Andrei & Rodríguez Mora, José Vicente, 2025. "The short and long run dynamics of the Great Gatsby Curve," Working Papers 49, University of Konstanz, Cluster of Excellence "The Politics of Inequality. Perceptions, Participation and Policies".
- Battiston, Diego & Maurer, Stephan & Potlogea, Andrei & RodrÃguez Mora, JosÃ©, 2026. "The Short and Long Run Dynamics of the Great Gatsby Curve," IZA Discussion Papers 18679, IZA Network @ LISER.
Andrea Del Pizzo & Martin Nybom & Jan Stuhler, 2026. "Indirect Estimators of Intergenerational Mobility," RFBerlin Discussion Paper Series 26137, ROCKWOOL Foundation Berlin (RFBerlin).
- Andrea Del Pizzo & Martin Nybom & Jan Stuhler, 2026. "Indirect Estimators of Intergenerational Mobility," Papers 2605.19154, arXiv.org.
- Andrea Del Pizzo & Martin Nybom & Jan Stuhler, 2026. "Indirect Estimators of Intergenerational Mobility," CESifo Working Paper Series 12663, CESifo.
- Del Pizzo, Andrea & Nybom, Martin & Stuhler, Jan, 2026. "Indirect Estimators of Intergenerational Mobility," CEPR Discussion Papers 21516, Centre for Economic Policy Research.
- Del Pizzo, Andrea & Nybom, Martin & Stuhler, Jan, 2026. "Indirect Estimators of Intergenerational Mobility," IZA Discussion Papers 18641, IZA Network @ LISER.
Elisa JÃ¡come & Ilyana Kuziemko & Suresh Naidu, 2021. "Mobility for All: Representative Intergenerational Mobility Estimates over the 20th Century," Working Papers 302, Princeton University, Department of Economics, Center for Economic Policy Studies..
Guido Alfani, 2022. "Epidemics, Inequality, and Poverty in Preindustrial and Early Industrial Times," Journal of Economic Literature, American Economic Association, vol. 60(1), pages 3-40, March.
- Guido Alfani, 2020. "Epidemics, inequality and poverty in preindustrial and early industrial times," Working Papers 2020-16, The George Washington University, Institute for International Economic Policy.
- Alfani, Guido, 2020. "Epidemics, inequality and poverty in preindustrial and early industrial times," CAGE Online Working Paper Series 520, Competitive Advantage in the Global Economy (CAGE).
- Guido Alfani, 2020. "Epidemics, inequality and poverty in preindustrial and early industrial times," Working Papers 0193, European Historical Economics Society (EHES).
- , Stone Center & Alfani, Guido, 2020. "Epidemics, Inequality and Poverty in Preindustrial and Early Industrial Times," SocArXiv 36cqf, Center for Open Science.
Hwang, Sam Il Myoung & Squires, Munir, 2024. "Linked samples and measurement error in historical US census data," Explorations in Economic History, Elsevier, vol. 93(C).
Juliana Jaramillo-Echeverri, 2024. "Movilidad social en la educación: el caso de la Universidad de los Andes en Colombia entre 1949 y 2018," Cuadernos de Historia Económica 61, Banco de la Republica de Colombia.
Jakob B. Madsen & Fabrice Murtin, 2017. "British economic growth since 1270: the role of education," Journal of Economic Growth, Springer, vol. 22(3), pages 229-272, September.
Madsen, Jakob B. & Robertson, Peter E. & Ye, Longfeng, 2019. "Malthus was right: Explaining a millennium of stagnation," European Economic Review, Elsevier, vol. 118(C), pages 51-68.
- Jacob B. Madsen & Peter E. Robertson & Longfeng Ye, 2019. "Malthus Was Right: Explaining a Millennium of Stagnation," Economics Discussion / Working Papers 19-16, The University of Western Australia, Department of Economics.
David Andersson & Mounir Karadja & Erik Prawitz, 2022. "Mass Migration and Technological Change," Journal of the European Economic Association, European Economic Association, vol. 20(5), pages 1859-1896.
- Andersson, David & Karadja, Mounir & Prawitz, Erik, 2020. "Mass Migration and Technological Change," SocArXiv 74ub8, Center for Open Science.
Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
- Gobillon, Laurent & Combes, Pierre-Philippe & Zylberberg, Yanos, 2020. "Urban economics in a historical perspective: Recovering data with machine learning," CEPR Discussion Papers 15308, Centre for Economic Policy Research.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2021. "Urban economics in a historical perspective: Recovering data with machine learning," PSE Working Papers halshs-03231786, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," PSE-Ecole d'économie de Paris (Postprint) halshs-03673240, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2021. "Urban economics in a historical perspective: Recovering data with machine learning," Working Papers halshs-03231786, HAL.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," Post-Print halshs-03673240, HAL.
- Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2021. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," IZA Discussion Papers 14392, IZA Network @ LISER.
- Pierre-Philippe Combes & Laurent Gobillon & Yanos Zylberberg, 2022. "Urban Economics in a Historical Perspective: Recovering Data with Machine Learning," Sciences Po Economics Publications (main) halshs-03673240, HAL.
Alexandra M. de Pleijt, 2018. "Human capital formation in the long run: evidence from average years of schooling in England, 1300–1900," Cliometrica, Journal of Historical Economics and Econometric History, Association Française de Cliométrie (AFC), vol. 12(1), pages 99-126, January.
- Alexandra M. de Pleijt, 2018. "Human capital formation in the long run: evidence from average years of schooling in England, 1300–1900," Cliometrica, Springer;Cliometric Society (Association Francaise de Cliométrie), vol. 12(1), pages 99-126, January.
Jensen, Peter Sandholt & Pedersen, Maja Uhre & Radu, Cristina Victoria & Sharp, Paul Richard, 2022. "Arresting the Sword of Damocles: The transition to the post-Malthusian era in Denmark," Explorations in Economic History, Elsevier, vol. 84(C).
Lehmann-Hasemeyer, Sibylle H. & Prettner, Klaus & Tscheuschner, Paul, 2020. "The scientific revolution and its role in the transition to sustained economic growth," Hohenheim Discussion Papers in Business, Economics and Social Sciences 06-2020, University of Hohenheim, Faculty of Business, Economics and Social Sciences.
Fiaschi, Davide & Fioroni, Tamara, 2019. "Transition to modern growth in Great Britain: The role of technological progress, adult mortality and factor accumulation," Structural Change and Economic Dynamics, Elsevier, vol. 51(C), pages 472-490.
Madsen, Jakob & Strulik, Holger, 2024. "Inequality and the Industrial Revolution," European Economic Review, Elsevier, vol. 164(C).
James Foreman-Peck & Peng Zhou, 2021. "Correction to: fertility versus productivity: a model of growth with evolutionary equilibria," Journal of Population Economics, Springer;European Society for Population Economics, vol. 34(4), pages 1473-1474, October.
- James Foreman-Peck & Peng Zhou, 2021. "Fertility versus productivity: a model of growth with evolutionary equilibria," Journal of Population Economics, Springer;European Society for Population Economics, vol. 34(3), pages 1073-1104, July.
- Foreman-Peck, James & Zhou, Peng, 2020. "Fertility versus Productivity: A Model of Growth with Evolutionary Equilibria," Cardiff Economics Working Papers E2020/13, Cardiff University, Cardiff Business School, Economics Section.
Matteo Cervellati & Gerrit Meyerheim & Uwe Sunde, 2023. "The empirics of economic growth over time and across nations: a unified growth perspective," Journal of Economic Growth, Springer, vol. 28(2), pages 173-224, June.
- Cervellati, Matteo & Meyerheim, Gerrit & Sunde, Uwe, 2022. "The Empirics of Economic Growth Over Time and Across Nations: A Unified Growth Perspective," Rationality and Competition Discussion Paper Series 339, CRC TRR 190 Rationality and Competition.
- Cervellati, Matteo & Meyerheim, Gerrit & Sunde, Uwe, 2023. "The Empirics of Economic Growth Over Time and Across Nations: A Unified Growth Perspective," CEPR Discussion Papers 18057, Centre for Economic Policy Research.
Alfani, Guido & Gierok, Victoria & Schaff, Felix, 2025. "Poverty in Germany from the Black Death until the Beginning of Industrialization," Explorations in Economic History, Elsevier, vol. 95(C).

More about this item

NEP fields

This paper has been announced in the following NEP Reports:

NEP-AIN-2026-01-19 (Artificial Intelligence)
NEP-BIG-2026-01-19 (Big Data)
NEP-CMP-2026-01-19 (Computational Economics)
NEP-HIS-2026-01-19 (Business, Economic and Financial History)

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2512.19675. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: https://arxiv.org/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

NEP fields

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data