IDEAS home Printed from https://ideas.repec.org/a/prg/jnlaip/vpreprintid273.html
   My bibliography  Save this article

Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study

Author

Listed:
  • Petr Hoza

Abstract

Background: Large language models (LLMs) excel at various tasks but often encounter difficulties when extended reasoning requires maintaining a consistent internal state. Identifying the threshold at which these systems fail under increasing task complexity is essential for reliable deployment. Objective: The primary objective was to examine whether four LLMs (GPT 3.5, GPT 4, GPT 4o-mini and GPT 4o) could preserve a hidden number and its arithmetic transformation across multiple yes/no queries and to determine whether a specific point of reasoning breakdown exists. Methods: A modified "Think a Number" game was employed, with complexity defined by the number of sequential yes/no queries (ranging from 1 to 9 or 11). Seven prompting strategies, including chain-of-thought variants, counterfactual prompts and few-shot examples, were evaluated. Each outcome was considered correct if the revealed number and transformation of the model remained consistent with prior answers. Results: Analysis of tens of thousands of trials showed no distinct performance cliff up to 9-11 queries, indicating that modern LLMs are more capable of consecutive reasoning than previously assumed. Counterfactual and certain chain-of-thought prompts outperformed simpler baselines. GPT 4o and GPT 4o-mini attained higher overall correctness, whereas GPT 3.5 and GPT 4 more often displayed contradictory or premature disclosures. Conclusion: In a controlled, scalable reasoning scenario, these LLMs demonstrated notable resilience to multi-step prompts. Both prompt design and model selection significantly influenced performance. Further research involving more intricate tasks and higher query counts is recommended to delineate the upper boundaries of LLM internal consistency.

Suggested Citation

  • Petr Hoza, . "Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study," Acta Informatica Pragensia, Prague University of Economics and Business, vol. 0.
  • Handle: RePEc:prg:jnlaip:v:preprint:id:273
    DOI: 10.18267/j.aip.273
    as

    Download full text from publisher

    File URL: http://aip.vse.cz/doi/10.18267/j.aip.273.html
    Download Restriction: free of charge

    File URL: https://libkey.io/10.18267/j.aip.273?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:prg:jnlaip:v:preprint:id:273. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Stanislav Vojir (email available below). General contact details of provider: https://edirc.repec.org/data/uevsecz.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.