IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2305.14672.html
   My bibliography  Save this paper

Quantifying Character Similarity with Vision Transformers

Author

Listed:
  • Xinmei Yang
  • Abhishek Arora
  • Shao-Yu Jheng
  • Melissa Dell

Abstract

Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR'ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as ``O'' and ``0'' - have similar vector representations. Using the cosine distance between characters' representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.

Suggested Citation

  • Xinmei Yang & Abhishek Arora & Shao-Yu Jheng & Melissa Dell, 2023. "Quantifying Character Similarity with Vision Transformers," Papers 2305.14672, arXiv.org.
  • Handle: RePEc:arx:papers:2305.14672
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2305.14672
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lane, Nathaniel, 2016. "Manufacturing Revolutions: Industrial Policy and Industrialization in South Korea," SocArXiv 6tqax, Center for Open Science.
    2. Jacob Carlson & Tom Bryan & Melissa Dell, 2023. "Efficient OCR for Building a Diverse Digital History," Papers 2304.02737, arXiv.org.
    3. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    4. Glenn Ellison & Edward L. Glaeser & William R. Kerr, 2010. "What Causes Industry Agglomeration? Evidence from Coagglomeration Patterns," American Economic Review, American Economic Association, vol. 100(3), pages 1195-1213, June.
    5. Ventura, Samuel L. & Nugent, Rebecca & Fuchs, Erica R.H., 2015. "Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records," Research Policy, Elsevier, vol. 44(9), pages 1672-1701.
    6. Dominick Bartelme & Yuriy Gorodnichenko, 2015. "Linkages and Economic Development," NBER Working Papers 21251, National Bureau of Economic Research, Inc.
    7. Mauricio Sadinle, 2017. "Bayesian Estimation of Bipartite Matchings for Record Linkage," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 600-612, April.
    8. Abhishek Arora & Xinmei Yang & Shao-Yu Jheng & Melissa Dell, 2023. "Linking Representations with Multimodal Contrastive Learning," Papers 2304.03464, arXiv.org, revised Apr 2023.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Abhishek Arora & Xinmei Yang & Shao-Yu Jheng & Melissa Dell, 2023. "Linking Representations with Multimodal Contrastive Learning," Papers 2304.03464, arXiv.org, revised Apr 2023.
    2. Melissa Dell & Benjamin A Olken, 2020. "The Development Effects of the Extractive Colonial Economy: The Dutch Cultivation System in Java," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 87(1), pages 164-203.
    3. Droller, Federico & Fiszbein, Martin, 2021. "Staple Products, Linkages, and Development: Evidence from Argentina," The Journal of Economic History, Cambridge University Press, vol. 81(3), pages 723-762, September.
    4. Shawn Xiaoguang Chen & Yudan Cheng & Liutang Gong & Wenjia Tian, 2023. "A Big Push of Panda from the Ground: Land Subsidy and Structural Transformation in China," Economics Discussion / Working Papers 23-09, The University of Western Australia, Department of Economics.
    5. Thomas Stringham, 2022. "Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 40(4), pages 1509-1522, October.
    6. Bo, Shiyu & Liu, Cong & Zhou, Yan, 2023. "Military investment and the rise of industrial clusters: Evidence from China’s self-strengthening movement," Journal of Development Economics, Elsevier, vol. 161(C).
    7. Stefano Breschi & Francesco Lissoni & Ernest Miguelez, 2017. "Foreign-origin inventors in the USA: testing for diaspora and brain gain effects," Journal of Economic Geography, Oxford University Press, vol. 17(5), pages 1009-1038.
    8. Zheng, Liang & Zhao, Zhong, 2017. "What drives spatial clusters of entrepreneurship in China? Evidence from economic census data," China Economic Review, Elsevier, vol. 46(C), pages 229-248.
    9. Lo Turco, Alessia & Maggioni, Daniela & Zazzaro, Alberto, 2019. "Financial dependence and growth: The role of input-output linkages," Journal of Economic Behavior & Organization, Elsevier, vol. 162(C), pages 308-328.
    10. Vasco M. Carvalho & Alireza Tahbaz-Salehi, 2019. "Production Networks: A Primer," Annual Review of Economics, Annual Reviews, vol. 11(1), pages 635-663, August.
    11. Philippe Martin & Thierry Mayer & Florian Mayneris, 2008. "Spatial Concentration and Firm-Level Productivity in France," Sciences Po publications 6858, Sciences Po.
    12. Emanuela Marrocu & Raffaele Paci & Stefano Usai, 2013. "Productivity Growth In The Old And New Europe: The Role Of Agglomeration Externalities," Journal of Regional Science, Wiley Blackwell, vol. 53(3), pages 418-442, August.
    13. Anna M. Ferragina & Giulia Nunziante, 2018. "Are Italian firms performances influenced by innovation of domestic and foreign firms nearby in space and sectors?," Economia e Politica Industriale: Journal of Industrial and Business Economics, Springer;Associazione Amici di Economia e Politica Industriale, vol. 45(3), pages 335-360, September.
    14. Wang, Liang & Tan, Justin & Li, Wan, 2018. "The impacts of spatial positioning on regional new venture creation and firm mortality over the industry life cycle," Journal of Business Research, Elsevier, vol. 86(C), pages 41-52.
    15. Bahar, Dany & Rosenow, Samuel & Stein, Ernesto & Wagner, Rodrigo, 2019. "Export take-offs and acceleration: Unpacking cross-sector linkages in the evolution of comparative advantage," World Development, Elsevier, vol. 117(C), pages 48-60.
    16. David Rezza Baqaee & Emmanuel Farhi, 2019. "The Macroeconomic Impact of Microeconomic Shocks: Beyond Hulten's Theorem," Econometrica, Econometric Society, vol. 87(4), pages 1155-1203, July.
    17. Ekaterina Aleksandrova & Kristian Behrens & Maria Kuznetsova, 2020. "Manufacturing (co)agglomeration in a transition country: Evidence from Russia," Journal of Regional Science, Wiley Blackwell, vol. 60(1), pages 88-128, January.
    18. Ufuk Akcigit & Harun Alp & André Diegmann & Nicolas Serrano-Velarde, 2023. "Committing to Grow: Privatizations and Firm Dynamics in East Germany," Working Papers 685, IGIER (Innocenzo Gasparini Institute for Economic Research), Bocconi University.
    19. Hyuk-Soo Kwon & Jihong Lee & Sokbae Lee & Ryungha Oh, 2022. "Knowledge spillovers and patent citations: trends in geographic localization, 1976–2015," Economics of Innovation and New Technology, Taylor & Francis Journals, vol. 31(3), pages 123-147, April.
    20. Stephen J. Redding, 2010. "The Empirics Of New Economic Geography," Journal of Regional Science, Wiley Blackwell, vol. 50(1), pages 297-311, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2305.14672. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.