{"id":7506,"date":"2024-07-31T04:00:00","date_gmt":"2024-07-31T04:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/31\/method-prevents-an-ai-model-from-being-overconfident-about-wrong-answers\/"},"modified":"2024-07-31T04:00:00","modified_gmt":"2024-07-31T04:00:00","slug":"method-prevents-an-ai-model-from-being-overconfident-about-wrong-answers","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/31\/method-prevents-an-ai-model-from-being-overconfident-about-wrong-answers\/","title":{"rendered":"Method prevents an AI model from being overconfident about wrong answers"},"content":{"rendered":"<p>Author: Adam Zewe | MIT News<\/p>\n<div>\n<p>People use large language models for a huge array of tasks, from translating an article to identifying financial fraud. However, despite the incredible capabilities and versatility of these models, they sometimes generate\u00a0inaccurate responses.<\/p>\n<p>On top of that problem, the models can be overconfident about wrong answers or underconfident about correct ones, making it tough for a user to know when a model can be trusted.<\/p>\n<p>Researchers typically calibrate a machine-learning model to ensure its level of confidence lines up with its accuracy. A well-calibrated model should have less confidence about an incorrect prediction, and vice-versa. But because large language models (LLMs) can be applied to a seemingly endless collection of\u00a0diverse tasks, traditional calibration methods are ineffective.<\/p>\n<p>Now, researchers from MIT and the MIT-IBM Watson AI Lab have introduced a calibration method tailored to large language models. Their method, called <a href=\"https:\/\/arxiv.org\/pdf\/2403.08819\" target=\"_blank\" rel=\"noopener\">Thermometer<\/a>, involves building a smaller, auxiliary model that runs on top of a large language model to calibrate it.<\/p>\n<p>Thermometer is more efficient than other approaches \u2014 requiring less power-hungry computation \u2014 while preserving the accuracy of the model and enabling it to produce better-calibrated responses on tasks it has not seen before.<\/p>\n<p>By enabling efficient calibration of an LLM for a variety of tasks, Thermometer could help users pinpoint situations where a model is overconfident about false predictions, ultimately preventing them from deploying that model in a situation where it may fail.<\/p>\n<p>\u201cWith Thermometer, we want to provide the user with a clear signal to tell them whether a model\u2019s response is accurate or inaccurate, in a way that reflects the model\u2019s uncertainty, so they know if that model is reliable,\u201d says Maohao Shen, an electrical engineering and computer science (EECS) graduate student and lead author of a <a href=\"https:\/\/arxiv.org\/pdf\/2403.08819\" target=\"_blank\" rel=\"noopener\">paper on Thermometer<\/a>.<\/p>\n<p>Shen is joined on the paper by Gregory Wornell, the Sumitomo Professor of Engineering who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory for Electronics, and is a member of the MIT-IBM Watson AI Lab; senior author Soumya Ghosh, a research staff member in the MIT-IBM Watson AI Lab; as well as others at MIT and the MIT-IBM Watson AI Lab. The research was recently presented at the International Conference on Machine Learning.<\/p>\n<p><strong>Universal calibration<\/strong><\/p>\n<p>Since traditional machine-learning models are typically designed to perform a\u00a0single task, calibrating them usually involves one task-specific method. On the other hand, since LLMs have the flexibility to perform many tasks, using a traditional method to calibrate that model for one task might hurt its performance on another task.<\/p>\n<p>Calibrating an LLM often involves\u00a0sampling from the model multiple times\u00a0to obtain different predictions and then aggregating these predictions to obtain better-calibrated confidence. However, because these models have billions of parameters, the computational costs of\u00a0such\u00a0approaches rapidly add up.<\/p>\n<p>\u201cIn a sense, large language models are universal because they can handle\u00a0various tasks. So, we need a universal calibration method that can also handle many different tasks,\u201d says Shen.<\/p>\n<p>With Thermometer, the researchers developed a versatile technique that leverages a classical calibration method called temperature scaling to efficiently calibrate an LLM for a new task.<\/p>\n<p>In this context, a \u201ctemperature\u201d is a\u00a0scaling parameter used to\u00a0adjust a\u00a0model\u2019s confidence\u00a0to be aligned with its prediction accuracy. Traditionally, one determines the right temperature using a labeled validation dataset of task-specific examples.<\/p>\n<p>Since LLMs are often applied to new tasks, labeled datasets can be nearly impossible to\u00a0acquire. For instance, a user who wants to deploy an LLM to answer customer questions about a new product likely does not have a dataset containing such questions and answers.<\/p>\n<p>Instead of using a labeled dataset, the researchers train an auxiliary model that runs on top of an LLM to automatically predict the temperature needed to calibrate it for this new task.<\/p>\n<p>They use labeled datasets of a few representative tasks to train the Thermometer model, but then once it has been trained, it can generalize to new tasks in\u00a0a similar category without the need for\u00a0additional labeled data.<\/p>\n<p>A Thermometer model trained on a\u00a0collection of multiple-choice question datasets, perhaps including one with algebra questions and one with\u00a0medical questions, could be used to calibrate an LLM that will answer questions about geometry or\u00a0biology, for instance.<\/p>\n<p>\u201cThe aspirational goal is for it to work on any task, but we are not quite there yet,\u201d Ghosh says.\u00a0 \u00a0<\/p>\n<p>The Thermometer model only needs to access a small part of the LLM\u2019s inner workings to predict the right temperature that will calibrate its prediction for data points of a specific task.\u00a0<\/p>\n<p><strong>An efficient approach<\/strong><\/p>\n<p>Importantly, the technique does not require multiple training runs and only slightly slows the LLM. Plus, since temperature scaling does not alter a model\u2019s predictions, Thermometer preserves its accuracy.<\/p>\n<p>When they compared Thermometer to several baselines on multiple tasks, it consistently produced better-calibrated uncertainty measures while requiring much less computation.<\/p>\n<p>\u201cAs long as we train a Thermometer model on a sufficiently large number of tasks, it should be able to generalize well across any new task, just like a large language model, it is also a universal model,\u201d Shen adds.<\/p>\n<p>The researchers also found that if they train a Thermometer model for a smaller LLM, it can be\u00a0directly\u00a0applied to\u00a0calibrate\u00a0a larger LLM within the same family.<\/p>\n<p>In the future, they want to adapt Thermometer for more complex text-generation tasks and apply the technique to even larger LLMs. The researchers also hope to quantify the\u00a0diversity and\u00a0number of labeled datasets one would need to train a Thermometer model so it can generalize to a new task.<\/p>\n<p>This research was funded, in part, by the MIT-IBM Watson AI Lab.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2024\/thermometer-prevents-ai-model-overconfidence-about-wrong-answers-0731\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Adam Zewe | MIT News People use large language models for a huge array of tasks, from translating an article to identifying financial fraud. [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2024\/07\/31\/method-prevents-an-ai-model-from-being-overconfident-about-wrong-answers\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":467,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/7506"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=7506"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/7506\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/467"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=7506"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=7506"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=7506"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}