{"id":7462,"date":"2024-07-11T19:50:00","date_gmt":"2024-07-11T19:50:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/11\/reasoning-skills-of-large-language-models-are-often-overestimated\/"},"modified":"2024-07-11T19:50:00","modified_gmt":"2024-07-11T19:50:00","slug":"reasoning-skills-of-large-language-models-are-often-overestimated","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/11\/reasoning-skills-of-large-language-models-are-often-overestimated\/","title":{"rendered":"Reasoning skills of large language models are often overestimated"},"content":{"rendered":"<p>Author: Rachel Gordon | MIT CSAIL<\/p>\n<div>\n<p>When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.<\/p>\n<p>MIT&#8217;s Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.<\/p>\n<p>The study compared \u201cdefault tasks,\u201d the common tasks a model is trained and tested on, with \u201ccounterfactual scenarios,\u201d hypothetical situations deviating from default conditions \u2014 which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models\u2019 comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models&#8217; capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.<\/p>\n<p>When users interact with language models, any arithmetic is usually in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a false impression of them having strong competency in addition. Logically, if they truly possess good addition skills, you\u2019d expect reliably high performance across all number bases, similar to calculators or computers. Indeed, the research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants and suffer from consistent and severe performance drop in the unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.\u00a0<\/p>\n<p>The pattern held true for many other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players are expected to still be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and couldn\u2019t perform better than random guessing, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on the standard tasks is likely not due to general task abilities, but overfitting to, or directly memorizing from, what they have seen in their training data.<\/p>\n<p>\u201cWe\u2019ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models\u2019 adaptability and broaden their application horizons,\u201d says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new <a href=\"https:\/\/arxiv.org\/abs\/2307.02477\" target=\"_blank\" rel=\"noopener\">paper<\/a> about the research. \u201cAs AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.\u201d<\/p>\n<p>Despite the insights gained, there are, of course, limitations. The study\u2019s focus on specific tasks and settings didn\u2019t capture the full range of challenges the models could potentially encounter in real-world applications, signaling the need for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean looking at more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better comprehend the rationale behind the models\u2019 decision-making processes.<\/p>\n<p>\u201cAs language models scale up, understanding their training data becomes increasingly challenging even for open models, let alone proprietary ones,\u201d says Hao Peng, assistant professor at the University of Illinois at Urbana-Champaign. \u201cThe community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is perhaps far more limited than anticipated by many. It has the potential to inspire future research towards identifying the failure modes of today\u2019s models and developing better ones.\u201d<\/p>\n<p>Additional authors include Najoung Kim, who is a Boston University assistant professor and Google visiting researcher, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Aky\u00fcrek SM \u201921, and Boyuan Chen; former postdoc and Apple AI\/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.\u00a0<\/p>\n<p>The team\u2019s study was supported, in part, by the MIT\u2013IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2024\/reasoning-skills-large-language-models-often-overestimated-0711\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Rachel Gordon | MIT CSAIL When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/11\/reasoning-skills-of-large-language-models-are-often-overestimated\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":459,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/7462"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=7462"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/7462\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/464"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=7462"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=7462"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=7462"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}