Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons

My bibliography Save this article

Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons

Author

Listed:

Gignac, Gilles E.
Ilić, David

Registered:

Abstract

Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (r ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.

Suggested Citation

Gignac, Gilles E. & Ilić, David, 2025. "Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons," Intelligence, Elsevier, vol. 110(C).

Handle: RePEc:eee:intell:v:110:y:2025:i:c:s016028962500025x
DOI: 10.1016/j.intell.2025.101922

Download full text from publisher

As the access to this document is restricted, you may want to

for a different version of it.

References listed on IDEAS

E. Cureton, 1966. "Corrected item-test correlations," Psychometrika, Springer;The Psychometric Society, vol. 31(1), pages 93-96, March.
Lee Cronbach, 1951. "Coefficient alpha and the internal structure of tests," Psychometrika, Springer;The Psychometric Society, vol. 16(3), pages 297-334, September.
Walter Kristof, 1969. "Estimation of true score and error variance for tests under various equivalence assumptions," Psychometrika, Springer;The Psychometric Society, vol. 34(4), pages 489-507, December.
Gignac, Gilles E., 2024. "Rethinking the Dunning-Kruger effect: Negligible influence on a limited segment of the population," Intelligence, Elsevier, vol. 104(C).
Gignac, Gilles E. & Szodorai, Eva T., 2024. "Defining intelligence: Bridging the gap between human and artificial perspectives," Intelligence, Elsevier, vol. 104(C).
Chalmers, R. Philip, 2012. "mirt: A Multidimensional Item Response Theory Package for the R Environment," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 48(i06).
Myszkowski, Nils & Storme, Martin, 2018. "A snapshot of g? Binary and polytomous item-response theory investigations of the last series of the Standard Progressive Matrices (SPM-LS)," Intelligence, Elsevier, vol. 68(C), pages 109-116.
Ilić, David & Gignac, Gilles E., 2024. "Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?," Intelligence, Elsevier, vol. 106(C).

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Klaas Sijtsma & Jules L. Ellis & Denny Borsboom, 2024. "Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment," Psychometrika, Springer;The Psychometric Society, vol. 89(1), pages 84-117, March.
Abhijeet Singh & Mauricio Romero & Karthik Muralidharan, 2022. "Covid-19 Learning Loss and Recovery: Panel Data Evidence from India," NBER Working Papers 30552, National Bureau of Economic Research, Inc.
- Abhijeet Singh & Mauricio Romero & Karthik Muralidharan, 2022. "Covid-19 Learning Loss and Recovery: Panel Data Evidence from India," CESifo Working Paper Series 10031, CESifo.
Anthony C Waddimba & David C Mohr & Howard B Beckman & Mark M Meterko, 2020. "Physicians’ perceptions of autonomy support during transition to value-based reimbursement: A multi-center psychometric evaluation of six-item and three-item measures," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-29, April.
Walter Kristof, 1971. "On the theory of a set of tests which differ only in length," Psychometrika, Springer;The Psychometric Society, vol. 36(3), pages 207-225, September.
Gignac, Gilles E., 2024. "Rethinking the Dunning-Kruger effect: Negligible influence on a limited segment of the population," Intelligence, Elsevier, vol. 104(C).
Sara Fernandes & Guillaume Fond & Xavier Zendjidjian & Pierre Michel & Karine Baumstarck & Christophe Lançon & Ludovic Samalin & Pierre-Michel Llorca & Magali Coldefy & Pascal Auquier & Laurent Boyer , 2022. "Development and Calibration of the PREMIUM Item Bank for Measuring Respect and Dignity for Patients with Severe Mental Illness," Post-Print hal-03649277, HAL.
Jari MetsÃ¤muuronen, 2012. "Challenges of the Fennema-Sherman Test in the International Comparisons," International Journal of Psychological Studies, Canadian Center of Science and Education, vol. 4(3), pages 1-1, September.
Ron D. Hays & Karen L. Spritzer & Steven P. Reise, 2021. "Using Item Response Theory to Identify Responders to Treatment: Examples with the Patient-Reported Outcomes Measurement Information System (PROMIS®) Physical Function Scale and Emotional Distress Comp," Psychometrika, Springer;The Psychometric Society, vol. 86(3), pages 781-792, September.
Anrafel de Souza Barbosa & Maria Cristina Crispim & Luiz Bueno da Silva & Jonhatan Magno Norte da Silva & Aglaucibelly Maciel Barbosa & Sandra Naomi Morioka, 2024. "How can organizations measure the integration of environmental, social, and governance (ESG) criteria? Validation of an instrument using item response theory to capture workers' perception," Business Strategy and the Environment, Wiley Blackwell, vol. 33(4), pages 3607-3634, May.
Francisco Liébana-Cabanillas & Nidhi Singh & Zoran Kalinic & Elena Carvajal-Trujillo, 2021. "Examining the determinants of continuance intention to use and the moderating effect of the gender and age of users of NFC mobile payments: a multi-analytical approach," Information Technology and Management, Springer, vol. 22(2), pages 133-161, June.
Yoon, Junghyun & Lee, Hee Yong & Dinwoodie, John, 2015. "Competitiveness of container terminal operating companies in South Korea and the industry–university–government network," Transportation Research Part A: Policy and Practice, Elsevier, vol. 80(C), pages 1-14.
Izolda Pristojkovic Suko & Magdalena Holter & Erwin Stolz & Elfriede Renate Greimel & Wolfgang Freidl, 2022. "Acculturation, Adaptation, and Health among Croatian Migrants in Austria and Ireland: A Cross-Sectional Study," IJERPH, MDPI, vol. 19(24), pages 1-15, December.
Md. Mominur Rahman & Bilkis Akhter, 2021. "The impact of investment in human capital on bank performance: evidence from Bangladesh," Future Business Journal, Springer, vol. 7(1), pages 1-13, December.
Usunier, Jean-Claude, 1998. "Oral pleasure and expatriate satisfaction: an empirical approach," International Business Review, Elsevier, vol. 7(1), pages 89-110, February.
Abdul Kadar Muhammad Masum & Md Abul Kalam Azad & Loo-See Beh, 2015. "Determinants of Academics' Job Satisfaction: Empirical Evidence from Private Universities in Bangladesh," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-15, February.
Amolo Elvis Juma Amolo & Charles Mallans Rambo & Charles Misiko Wafula, 2024. "Hedging Derivatives and Performance of Renewable Energy Projects in Kenya," International Journal of Research and Scientific Innovation, International Journal of Research and Scientific Innovation (IJRSI), vol. 11(8), pages 619-630, August.
Sharma, Vivek & Bhat, Dada Ab Rouf, 2020. "An empirical study exploring the relationship among human capital innovation, service innovation, competitive advantage and employee productivity in hospitality services," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 9(2), pages 1-14..
Deepak, 2016. "Antecedent Value of Professional Commitment and Job Involvement in Determining Job Satisfaction," Management and Labour Studies, XLRI Jamshedpur, School of Business Management & Human Resources, vol. 41(2), pages 154-164, May.
Abernethy, Margaret A. & Vagnoni, Emidia, 2004. "Power, organization design and managerial behaviour," Accounting, Organizations and Society, Elsevier, vol. 29(3-4), pages 207-225.
Marianela Denegri & María Baeza & Natalia Salinas-Oñate & Verónica Peñaloza & Horacio Miranda & Ligia Orellana, 2014. "Materialism in Pedagogy Students in Chile," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 117(2), pages 505-521, June.

More about this item

Keywords

; ; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:intell:v:110:y:2025:i:c:s016028962500025x. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: https://www.journals.elsevier.com/intelligence .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data