{"id":9155,"date":"2026-06-26T13:00:00","date_gmt":"2026-06-26T13:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/26\/llms-help-robots-understand-vague-instructions-and-focus-on-key-details\/"},"modified":"2026-06-26T13:00:00","modified_gmt":"2026-06-26T13:00:00","slug":"llms-help-robots-understand-vague-instructions-and-focus-on-key-details","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/26\/llms-help-robots-understand-vague-instructions-and-focus-on-key-details\/","title":{"rendered":"LLMs help robots understand vague instructions and focus on key details"},"content":{"rendered":"<p>Author: Alex Shipps | MIT CSAIL<\/p>\n<div>\n<p dir=\"ltr\">Imagine working at a warehouse or office sometime in the near future, and you\u2019re asked to help a new trainee learn the basics of their job. The catch: It\u2019s a robot. To teach them, you might want to play a game of \u201cshow and tell\u201d \u2014 that is, physically showing how to do something a few different ways, while also explaining what you\u2019re doing.<\/p>\n<p dir=\"ltr\">Let\u2019s say you asked the robot to place some coffee on your desk without disturbing you during a Zoom call. You\u2019ll prefer that the robot doesn\u2019t get too close to you and the laptop so that it doesn\u2019t interrupt your meeting. To enable this behavior, the robot should be trained with data that clearly demonstrates the full task. Computer scientists have attempted to explain manipulation tasks to robots by recording lots of physical demonstrations or writing extensive directions. But if you don\u2019t have both, the machine is likely to misunderstand what it needs to do.<\/p>\n<p>It\u2019s laborious for humans to do all that showing and telling, so researchers at MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL) have automated the process of teaching a robot, while clarifying instructions automatically and using nearly five times less demonstration data. Their \u201cMasked Inverse Reinforcement Learning\u201d (Masked IRL) approach uses a large language model (LLM) to elaborate on ambiguous prompts based on the data collected from a user\u2019s demo. Another LLM then narrows down which details an algorithm should incorporate into a motion plan, so that a robot can safely complete chores in homes, offices, and factories.<\/p>\n<p>\u201cOur approach could come in handy when a human interacts with a robot but doesn\u2019t want to spell out all the details of a task,\u201d says MIT PhD student and CSAIL researcher Minyoung Hwang, who is a lead author on a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2511.14565\">paper<\/a> presenting the project. \u201cWe\u2019re minimizing human effort by enabling machines to get to the bottom of what users really want.\u201d<\/p>\n<p dir=\"ltr\">According to Hwang, Masked IRL can help robots safely maneuver in settings where there are elements a human might not describe in a prompt, but that are crucial nonetheless. For example, a machine grabbing you a snack from the kitchen may not know to avoid bumping into your laptop. Likewise, a factory robot placing items into different boxes must carefully navigate around shelves.<\/p>\n<p dir=\"ltr\">To learn new tasks in these situations, Masked IRL uses the robot\u2019s sensors to capture information about its surroundings. These components also log each movement of a kinesthetic demonstration \u2014 a training approach where a human physically moves a robot to do a specific action. It\u2019s sort of like being the machine\u2019s physical therapist, bending joints in a particular direction to show a robot how to grab, move, and place objects.<\/p>\n<p dir=\"ltr\">MIT\u2019s system then calls on an LLM to compare this sequence of motions (called a trajectory) to the shortest possible path. The model also elaborates on what might be unclear in a prompt, turning a request like \u201cstay close\u201d into \u201cstay close to the surface of the table.\u201d Using the trajectory comparison and clarified directions, the LLM begins to understand why the motions it was trained on are important to the task.\u00a0<\/p>\n<p>A second LLM then evaluates details of the environment, such as the position of obstacles and the shape of the robot\u2019s target object. During this process, it \u201cmasks\u201d (in other words, ignores) the elements it deems irrelevant to the task at hand, scoring each one as either a \u201c1\u201d (important) or \u201c0\u201d (not so much). For example, whether or not a user was leaning on a table during a demonstration would be a \u201c0,\u201d making it irrelevant. Any detail considered a \u201c1\u201d is incorporated into the final action plan by an algorithm.<\/p>\n<p>These masks gave Masked IRL a key advantage over comparable baselines in both 3D and real-world demos because it taught a robot which information to prioritize. Thanks to the researchers\u2019 system, virtual and real robots alike were able to skillfully maneuver objects around obstacles, such as moving a coffee mug around a laptop to different spots on a table. In these tasks, Masked IRL correctly identified users\u2019 preferences, which they didn\u2019t explicitly state in their prompts, up to 15 percent more often than comparable baselines.<\/p>\n<p>During simulation experiments, CSAIL researchers also found that Masked IRL was a fast learner. It required fewer demos to understand how to move the mug than its baselines. They also found that the robots performed better when an LLM cleared up instructions, instead of having the machine try to follow a vague request.<\/p>\n<p>This more focused approach also translated well to a real robotic arm, executing prompts the system hadn\u2019t seen during its training phase. After being trained on 50 kinesthetic demonstrations, the robot carefully moved a cup toward a human while avoiding colliding with a user\u2019s computer \u2014 an obstacle it learned to avoid by elaborating on a more general request to \u201cstay away.\u201d It also wiped a table down while \u201cstaying close\u201d to it, and handed a user a bag of chips while \u201cstaying away\u201d from both a human and a table.<\/p>\n<p dir=\"ltr\">Masked IRL senses and explains what users leave unsaid, but soon, it might \u201csee\u201d it too. CSAIL researchers plan to make their approach more dynamic by equipping it with cameras, allowing a robot to take images of its surroundings. Then it could highlight and focus on specific elements nearby. For example, if you asked the machine to pick up a toy, it might see some bananas nearby and ignore them before handling its target object.<\/p>\n<p dir=\"ltr\">Hwang wrote the paper with three CSAIL colleagues: PhD student Alexandra Forsey-Smerek \u201920, SM \u201922; postdoc Nathaniel Dennler; and MIT Assistant Professor Andreea Bobu, who is a member of the Department of Aeronautics and Astronautics and CSAIL. Their work was supported, in part, by the Tata Group via the MIT Generative AI Impact Consortium Award, and the Department of Defense. They\u2019ll present the project at the 2026 IEEE International Conference on Robotics and Automation in June.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2026\/llms-help-robots-understand-vague-instructions-and-focus-key-details-0626\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Alex Shipps | MIT CSAIL Imagine working at a warehouse or office sometime in the near future, and you\u2019re asked to help a new [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/26\/llms-help-robots-understand-vague-instructions-and-focus-on-key-details\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":471,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/9155"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=9155"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/9155\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/469"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=9155"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=9155"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=9155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}