IDEAS home Printed from https://ideas.repec.org/a/spr/astaws/v17y2023i3d10.1007_s11943-023-00332-y.html
   My bibliography  Save this article

Quality aspects of annotated data

Author

Listed:
  • Jacob Beck

    (Ludwig-Maximilians-University Munich)

Abstract

The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance. In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of annotators and the strategy of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.

Suggested Citation

  • Jacob Beck, 2023. "Quality aspects of annotated data," AStA Wirtschafts- und Sozialstatistisches Archiv, Springer;Deutsche Statistische Gesellschaft - German Statistical Society, vol. 17(3), pages 331-353, December.
  • Handle: RePEc:spr:astaws:v:17:y:2023:i:3:d:10.1007_s11943-023-00332-y
    DOI: 10.1007/s11943-023-00332-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11943-023-00332-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11943-023-00332-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Andrius Vabalas & Emma Gowen & Ellen Poliakoff & Alexander J Casson, 2019. "Machine learning algorithm validation with a limited sample size," PLOS ONE, Public Library of Science, vol. 14(11), pages 1-20, November.
    2. Berinsky, Adam J. & Huber, Gregory A. & Lenz, Gabriel S., 2012. "Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk," Political Analysis, Cambridge University Press, vol. 20(3), pages 351-368, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pan, Jing Yu & Liu, Dahai, 2022. "Mask-wearing intentions on airplanes during COVID-19 – Application of theory of planned behavior model," Transport Policy, Elsevier, vol. 119(C), pages 32-44.
    2. repec:plo:pone00:0085508 is not listed on IDEAS
    3. Michele Cantarella & Chiara Strozzi, 2021. "Workers in the crowd: the labor market impact of the online platform economy [An evaluation of instrumental variable strategies for estimating the effects of catholic schooling]," Industrial and Corporate Change, Oxford University Press and the Associazione ICC, vol. 30(6), pages 1429-1458.
    4. Robbett, Andrea & Matthews, Peter Hans, 2018. "Partisan bias and expressive voting," Journal of Public Economics, Elsevier, vol. 157(C), pages 107-120.
    5. Li-Dunn Chen & Michael A Caprio & Devin M Chen & Andrew J Kouba & Carrie K Kouba, 2024. "Enhancing predictive performance for spectroscopic studies in wildlife science through a multi-model approach: A case study for species classification of live amphibians," PLOS Computational Biology, Public Library of Science, vol. 20(2), pages 1-24, February.
    6. Ephrem Habyarimana & Faheem S Baloch, 2021. "Machine learning models based on remote and proximal sensing as potential methods for in-season biomass yields prediction in commercial sorghum fields," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-23, March.
    7. Park, JungKun & Ahn, Jiseon & Thavisay, Toulany & Ren, Tianbao, 2019. "Examining the role of anxiety and social influence in multi-benefits of mobile payment service," Journal of Retailing and Consumer Services, Elsevier, vol. 47(C), pages 140-149.
    8. Chunhao Wei & Han Chen & Yee Ming Lee, 2022. "COVID-19 preventive measures and restaurant customers’ intention to dine out: the role of brand trust and perceived risk," Service Business, Springer;Pan-Pacific Business Association, vol. 16(3), pages 581-600, September.
    9. Masha Shunko & Julie Niederhoff & Yaroslav Rosokha, 2018. "Humans Are Not Machines: The Behavioral Impact of Queueing Design on Service Time," Management Science, INFORMS, vol. 64(1), pages 453-473, January.
    10. Yoram Halevy & Guy Mayraz, 2024. "Identifying Rule-Based Rationality," The Review of Economics and Statistics, MIT Press, vol. 106(5), pages 1369-1380, September.
    11. Abel Brodeur, Nikolai M. Cook, Anthony Heyes, 2022. "We Need to Talk about Mechanical Turk: What 22,989 Hypothesis Tests Tell Us about Publication Bias and p-Hacking in Online Experiments," LCERPA Working Papers am0133, Laurier Centre for Economic Research and Policy Analysis.
    12. Lude, Maximilian & Prügl, Reinhard, 2021. "Experimental studies in family business research," Journal of Family Business Strategy, Elsevier, vol. 12(1).
    13. Mattozzi, Andrea & Snowberg, Erik, 2018. "The right type of legislator: A theory of taxation and representation," Journal of Public Economics, Elsevier, vol. 159(C), pages 54-65.
    14. Jasper Grashuis & Theodoros Skevas & Michelle S. Segovia, 2020. "Grocery Shopping Preferences during the COVID-19 Pandemic," Sustainability, MDPI, vol. 12(13), pages 1-10, July.
    15. Jeanette A.M.J. Deetlefs & Mathew Chylinski & Andreas Ortmann, 2015. "MTurk ‘Unscrubbed’: Exploring the good, the ‘Super’, and the unreliable on Amazon’s Mechanical Turk," Discussion Papers 2015-20, School of Economics, The University of New South Wales.
    16. Jun Zhang & Joon Soo Lim, 2021. "Mitigating negative spillover effects in a product-harm crisis: strategies for market leaders versus market challengers," Journal of Brand Management, Palgrave Macmillan, vol. 28(1), pages 77-98, January.
    17. Haas, Nicholas & Hassan, Mazen & Mansour, Sarah & Morton, Rebecca B., 2021. "Polarizing information and support for reform," Journal of Economic Behavior & Organization, Elsevier, vol. 185(C), pages 883-901.
    18. Cantarella, Michele & Strozzi, Chiara, 2019. "Workers in the Crowd: The Labour Market Impact of the Online Platform Economy," IZA Discussion Papers 12327, Institute of Labor Economics (IZA).
    19. O. Ashton Morgan & John C. Whitehead, 2018. "Willingness to Pay for Soccer Player Development in the United States," Journal of Sports Economics, , vol. 19(2), pages 279-296, February.
    20. John Hulland & Jeff Miller, 2018. "“Keep on Turkin’”?," Journal of the Academy of Marketing Science, Springer, vol. 46(5), pages 789-794, September.
    21. Atalay, Kadir & Bakhtiar, Fayzan & Cheung, Stephen & Slonim, Robert, 2014. "Savings and prize-linked savings accounts," Journal of Economic Behavior & Organization, Elsevier, vol. 107(PA), pages 86-106.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:astaws:v:17:y:2023:i:3:d:10.1007_s11943-023-00332-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.