IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2102.03239.html
   My bibliography  Save this paper

Applications of Machine Learning in Document Digitisation

Author

Listed:
  • Christian M. Dahl
  • Torben S. D. Johansen
  • Emil N. S{o}rensen
  • Christian E. Westermann
  • Simon F. Wittrock

Abstract

Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that 'large and detailed' usually implies 'costly and difficult', especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. We instead advocate the use of modern machine learning techniques to automate the digitisation process. We give an overview of the potential for applying machine digitisation for data collection through two illustrative applications. The first demonstrates that unsupervised layout classification applied to raw scans of nurse journals can be used to construct a treatment indicator. Moreover, it allows an assessment of assignment compliance. The second application uses attention-based neural networks for handwritten text recognition in order to transcribe age and birth and death dates from a large collection of Danish death certificates. We describe each step in the digitisation pipeline and provide implementation insights.

Suggested Citation

  • Christian M. Dahl & Torben S. D. Johansen & Emil N. S{o}rensen & Christian E. Westermann & Simon F. Wittrock, 2021. "Applications of Machine Learning in Document Digitisation," Papers 2102.03239, arXiv.org.
  • Handle: RePEc:arx:papers:2102.03239
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2102.03239
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Stefan Wager & Susan Athey, 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1228-1242, July.
    2. Susan Athey & Guido Imbens & Jonas Metzger & Evan Munro, 2019. "Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations," Papers 1909.02210, arXiv.org, revised Jul 2020.
    3. Joshua D. Angrist & Alan B. Keueger, 1991. "Does Compulsory School Attendance Affect Schooling and Earnings?," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 106(4), pages 979-1014.
    4. Frank Windmeijer & Helmut Farbmacher & Neil Davies & George Davey Smith, 2019. "On the Use of the Lasso for Instrumental Variables Estimation with Some Invalid Instruments," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(527), pages 1339-1350, July.
    5. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    6. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jovet, Yoann & Lefèvre, Frédéric & Laurent, Alexis & Clausse, Marc, 2022. "Combined energetic, economic and climate change assessment of heat pumps for industrial waste heat recovery," Applied Energy, Elsevier, vol. 313(C).
    2. Dahl, Christian M. & Hansen, Casper W. & Jensen, Peter S. & Karlsson, Martin & Kühnle, Daniel, 2023. "School Closures, Mortality, and Human Capital: Evidence from the Universe of Closures during the 1918 Pandemic in Sweden," IZA Discussion Papers 16592, Institute of Labor Economics (IZA).
    3. Albers, Thilo N. H. & Kappner, Kalle, 2022. "Perks and Pitfalls of City Directories as a Micro-Geographic Data Source," Rationality and Competition Discussion Paper Series 315, CRC TRR 190 Rationality and Competition.
    4. Blomqvist, Christopher & Enflo, Kerstin & Jakobsson, Andreas & Åström, Kalle, 2023. "Reading the ransom: Methodological advancements in extracting the Swedish Wealth Tax of 1571," Explorations in Economic History, Elsevier, vol. 87(C).
    5. Dahl, Christian M. & Johansen, Torben S.D. & Sørensen, Emil N. & Wittrock, Simon, 2023. "HANA: A handwritten name database for offline handwritten text recognition," Explorations in Economic History, Elsevier, vol. 87(C).
    6. Albers, Thilo N.H. & Kappner, Kalle, 2023. "Perks and pitfalls of city directories as a micro-geographic data source," Explorations in Economic History, Elsevier, vol. 87(C).
    7. Caratozzolo, Vincenzo & Misuri, Alessio & Cozzani, Valerio, 2022. "A generalized equipment vulnerability model for the quantitative risk assessment of horizontal vessels involved in Natech scenarios triggered by floods," Reliability Engineering and System Safety, Elsevier, vol. 223(C).
    8. Claudia ANTAL-VAIDA, 2021. "Basic Hyperparameters Tuning Methods for Classification Algorithms," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 25(2), pages 64-74.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Patrick Krennmair & Timo Schmid, 2022. "Flexible domain prediction using mixed effects random forests," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1865-1894, November.
    2. Arthur Charpentier & Emmanuel Flachaire & Antoine Ly, 2017. "Econom\'etrie et Machine Learning," Papers 1708.06992, arXiv.org, revised Mar 2018.
    3. Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
    4. Domenico Giannone & Michele Lenza & Giorgio E. Primiceri, 2021. "Economic Predictions With Big Data: The Illusion of Sparsity," Econometrica, Econometric Society, vol. 89(5), pages 2409-2437, September.
    5. Grodecka, Anna & Hull, Isaiah, 2019. "The Impact of Local Taxes and Public Services on Property Values," Working Paper Series 374, Sveriges Riksbank (Central Bank of Sweden).
    6. Guido W. Imbens, 2022. "Causality in Econometrics: Choice vs Chance," Econometrica, Econometric Society, vol. 90(6), pages 2541-2566, November.
    7. Lenza, Michele & Moutachaker, Inès & Paredes, Joan, 2023. "Density forecasts of inflation: a quantile regression forest approach," Working Paper Series 2830, European Central Bank.
    8. Arthur Charpentier & Emmanuel Flachaire & Antoine Ly, 2018. "Économétrie & Machine Learning," Working Papers hal-01568851, HAL.
    9. Daniele Guariso, 2018. "Terrorist Attacks and Immigration Rhetoric: A Natural Experiment on British MPs," Working Paper Series 1218, Department of Economics, University of Sussex Business School.
    10. Ajit Desai, 2023. "Machine Learning for Economics Research: When What and How?," Papers 2304.00086, arXiv.org, revised Apr 2023.
    11. Isaiah Hull & Anna Grodecka-Messi, 2022. "Measuring the Impact of Taxes and Public Services on Property Values: A Double Machine Learning Approach," Papers 2203.14751, arXiv.org.
    12. Susan Athey & Julie Tibshirani & Stefan Wager, 2016. "Generalized Random Forests," Papers 1610.01271, arXiv.org, revised Apr 2018.
    13. Michael C. Knaus & Michael Lechner & Anthony Strittmatter, 2022. "Heterogeneous Employment Effects of Job Search Programs: A Machine Learning Approach," Journal of Human Resources, University of Wisconsin Press, vol. 57(2), pages 597-636.
    14. Athey, Susan & Imbens, Guido W., 2019. "Machine Learning Methods Economists Should Know About," Research Papers 3776, Stanford University, Graduate School of Business.
    15. Alpino, Matteo & Hauge, Karen Evelyn & Kotsadam, Andreas & Markussen, Simen, 2022. "Effects of dialogue meetings on sickness absence—Evidence from a large field experiment," Journal of Health Economics, Elsevier, vol. 83(C).
    16. Olga Takacs & Janos Vincze, 2019. "Blinder-Oaxaca decomposition with recursive tree-based methods: a technical note," CERS-IE WORKING PAPERS 1923, Institute of Economics, Centre for Economic and Regional Studies.
    17. Mr. Andrew J Tiffin, 2019. "Machine Learning and Causality: The Impact of Financial Crises on Growth," IMF Working Papers 2019/228, International Monetary Fund.
    18. Alena Skolkova, 2023. "Instrumental Variable Estimation with Many Instruments Using Elastic-Net IV," CERGE-EI Working Papers wp759, The Center for Economic Research and Graduate Education - Economics Institute, Prague.
    19. Max H. Farrell & Tengyuan Liang & Sanjog Misra, 2021. "Deep Neural Networks for Estimation and Inference," Econometrica, Econometric Society, vol. 89(1), pages 181-213, January.
    20. Olga Takács & János Vincze, 2020. "The gender-dependent structure of wages in Hungary: results using machine learning techniques," CERS-IE WORKING PAPERS 2044, Institute of Economics, Centre for Economic and Regional Studies.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2102.03239. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.