IDEAS home Printed from https://ideas.repec.org/p/osf/socarx/er6mz_v1.html

Model Diversity Over Model Size: Unanimous LLM Ensembles Correct Over-Classification in Survey Coding

Author

Listed:
  • Soria, Chris

Abstract

Large language models are increasingly used to classify open-ended survey responses, but they systematically over-classify, assigning categories too liberally on ambiguous cases and producing high sensitivity but low precision. This problem is most severe on subjectively ambiguous categories where models default to "yes" when uncertain. Drawing on the established principle that aggregating multiple noisy annotators outperforms any single annotator, we test whether ensembles of LLMs can correct this problem. Using four open-ended survey questions with human-coded ground truth (3,208 responses, 6 categories per question), we evaluate ensemble configurations across 16 models spanning three cost tiers and six providers. Unanimous voting (requiring all models to agree before assigning a category) directly corrects over-classification by dramatically improving specificity: on the most ambiguous categories, the false positive rate drops from 50% to 3%, and precision triples. This advantage concentrates precisely where over-classification is worst, on subjectively ambiguous categories with fuzzy boundaries, while categories with clear criteria show no benefit. This pattern replicates across three independent datasets. Cross-provider model diversity is the key ingredient: models from different providers make different errors on ambiguous cases, and consensus filters the idiosyncratic false positives. Temperature variation and within-family size scaling contribute nothing. As few as three diverse lower-tier models suffice to reliably exceed GPT-5. For the ambiguous classification problems common in open-ended survey research, the well-established annotation principle of multi-coder agreement transfers directly to LLMs: investing in diverse perspectives is more effective than investing in a single expensive model.

Suggested Citation

  • Soria, Chris, 2026. "Model Diversity Over Model Size: Unanimous LLM Ensembles Correct Over-Classification in Survey Coding," SocArXiv er6mz_v1, Center for Open Science.
  • Handle: RePEc:osf:socarx:er6mz_v1
    DOI: 10.31219/osf.io/er6mz_v1
    as

    Download full text from publisher

    File URL: https://osf.io/download/6a2212a1a57bb922fdb08428/
    Download Restriction: no

    File URL: https://libkey.io/10.31219/osf.io/er6mz_v1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. A. P. Dawid & A. M. Skene, 1979. "Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 28(1), pages 20-28, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yeganeh Alimohammadi & Karissa Huang & Christian Borgs & Jennifer Chayes, 2026. "Auditing the Auditors: Does Community-based Moderation Get It Right?," Papers 2603.18053, arXiv.org, revised May 2026.
    2. Xiaoxiao Yang & Jing Zhang & Jun Peng & Lihong Lei, 2021. "Incentive mechanism based on Stackelberg game under reputation constraint for mobile crowdsensing," International Journal of Distributed Sensor Networks, , vol. 17(6), pages 15501477211, June.
    3. Junming Yin & Jerry Luo & Susan A. Brown, 2021. "Learning from Crowdsourced Multi-labeling: A Variational Bayesian Approach," Information Systems Research, INFORMS, vol. 32(3), pages 752-773, September.
    4. Aksoy, Cevat Giray & Dolls, Mathias & Klejdysz, Justyna & Peichl, Andreas & Windsteiger, Lisa, 2025. "Speaking of Debt: Framing, Guilt, and Economic Choices," CEPR Discussion Papers 20588, Centre for Economic Policy Research.
    5. Wanxue Dong & Maytal Saar-Tsechansky & Tomer Geva, 2025. "A Machine Learning Framework for Assessing Experts’ Decision Quality," Management Science, INFORMS, vol. 71(7), pages 5696-5721, July.
    6. Dustin Wright & Isabelle Augenstein, 2025. "Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift," PLOS ONE, Public Library of Science, vol. 20(6), pages 1-28, June.
    7. Yuqing Kong, 2021. "Information Elicitation Meets Clustering," Papers 2110.00952, arXiv.org.
    8. Tomer Geva & Maytal Saar‐Tsechansky, 2021. "Who Is a Better Decision Maker? Data‐Driven Expert Ranking Under Unobserved Quality," Production and Operations Management, Production and Operations Management Society, vol. 30(1), pages 127-144, January.
    9. Jesus Cerquides & Mehmet Oğuz Mülâyim & Jerónimo Hernández-González & Amudha Ravi Shankar & Jose Luis Fernandez-Marquez, 2021. "A Conceptual Probabilistic Framework for Annotation Aggregation of Citizen Science Data," Mathematics, MDPI, vol. 9(8), pages 1-15, April.
    10. Shi, Hui & Zhu, Lida & Wang, Guofa & Lv, Chen & Ji, Yonghui & Yang, Lie & Yang, Jianyu & Pan, Ji'an & Yi, Ran, 2025. "Human-inspired risk level annotation for autonomous driving in open-pit mine environments," Energy, Elsevier, vol. 341(C).
    11. Ahfock, Daniel & McLachlan, Geoffrey J., 2021. "Harmless label noise and informative soft-labels in supervised classification," Computational Statistics & Data Analysis, Elsevier, vol. 161(C).
    12. Xiu Fang & Suxin Si & Guohao Sun & Quan Z. Sheng & Wenjun Wu & Kang Wang & Hang Lv, 2022. "Selecting Workers Wisely for Crowdsourcing When Copiers and Domain Experts Co-exist," Future Internet, MDPI, vol. 14(2), pages 1-22, January.
    13. Alaa Ghanaiem & Evgeny Kagan & Parteek Kumar & Tal Raviv & Peter Glynn & Irad Ben-Gal, 2023. "Unsupervised Classification under Uncertainty: The Distance-Based Algorithm," Mathematics, MDPI, vol. 11(23), pages 1-19, November.
    14. Jing Wang & Panagiotis G. Ipeirotis & Foster Provost, 2017. "Cost-Effective Quality Assurance in Crowd Labeling," Information Systems Research, INFORMS, vol. 28(1), pages 137-158, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:osf:socarx:er6mz_v1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: OSF (email available below). General contact details of provider: https://arabixiv.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.