{"id":9096,"date":"2026-06-03T21:00:00","date_gmt":"2026-06-03T21:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/03\/teaching-ai-agents-to-ask-better-questions-by-playing-battleship\/"},"modified":"2026-06-03T21:00:00","modified_gmt":"2026-06-03T21:00:00","slug":"teaching-ai-agents-to-ask-better-questions-by-playing-battleship","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/03\/teaching-ai-agents-to-ask-better-questions-by-playing-battleship\/","title":{"rendered":"Teaching AI agents to ask better questions by playing \u201cBattleship\u201d"},"content":{"rendered":"<p>Author: Alex Shipps | MIT CSAIL<\/p>\n<div>\n<p dir=\"ltr\">In 2026, the hype for artificial intelligence agents is louder than ever before. These semi-autonomous programs can \u201cthink\u201d and execute well-defined tasks in areas like customer service and software development, typically using language models (LMs). But fields like medical diagnosis and scientific discovery require them to inquire about a vast range of solutions in uncertain environments, which LMs struggle with.<\/p>\n<p>Researchers at MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University\u2019s School of Engineering and Applied Sciences (SEAS) peered deeper into LMs to understand their main issues in high-stakes settings. Their test: \u201cBattleship,\u201d a classic guessing game that\u2019s helped cognitive scientists study how humans seek information.\u00a0<\/p>\n<p dir=\"ltr\">CSAIL and SEAS scholars added a twist by reframing the game around asking and answering natural language questions. In their \u201cCollaborative Battleship\u201d game, one participant is a \u201ccaptain\u201d who inquires about where hidden ships are, while their teammate plays the \u201cspotter\u201d by responding to those questions in real-time.<\/p>\n<p dir=\"ltr\">The researchers first had over 40 humans play the game together, collecting their questions and yes-no answers to build the \u201cBattleshipQA\u201d dataset. These results were a helpful point of comparison when the team tested state-of-the-art LMs (like GPT-5) and smaller models (like Llama 4 Scout) on their game. Without training the models beforehand, they found that top LMs can \u201cbeat\u201d humans at \u201cBattleship\u201d \u2014 that is, complete the game in fewer turns \u2014 but smaller systems are far less rational.<\/p>\n<p dir=\"ltr\">The chief issue was that many models are simply not adept at coming up with useful questions. To get LMs to inquire in ways that reveal more information about hidden ships, the researchers gave each model a Monte Carlo inference strategy, which carefully measures the likelihood of different options being correct with each response. The result: AI models that can beat regular players at \u201cBattleship,\u201d regardless of scale.<\/p>\n<p dir=\"ltr\">Perhaps the most striking results were Llama 4 Scout\u2019s gains. As a relatively small LM, it only beat humans 8 percent of the time. But with refinements to its inference strategy, the model reached a \u201cBattleship\u201d win rate of 82 percent versus humans. This careful and efficient style of asking questions also enabled the model to outpace a frontier model (GPT-5), while operating at around 1 percent of its cost.<\/p>\n<p dir=\"ltr\">On top of this improvement, the researchers shrank the gap between humans and LMs in answering questions. While GPT-5 was a reliable spotter that helped models finish games faster, smaller systems had a bad habit of giving the wrong answers about where ships were hidden. The models saw an accuracy boost of 15 percent on average when they began converting questions into code that explicitly tells them how to verify their answers (for example, having the model run a quick search of an area when asked if a ship was there).\u00a0<\/p>\n<p dir=\"ltr\">\u201cToday\u2019s language models are primarily optimized to answer complex queries, but it\u2019s less clear whether they learn to ask good questions for themselves,\u201d says MIT PhD student and CSAIL researcher Gabriel Grand SM \u201923, who is a lead author on a\u00a0<a href=\"https:\/\/openreview.net\/forum?id=EQhUvWH78U\">paper<\/a> about the work. \u201cOur work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a \u2018world model,\u2019 they ask better questions and make discoveries more efficiently.\u201d<\/p>\n<p><strong>A sea change for LMs<\/strong><\/p>\n<p dir=\"ltr\">The team\u2019s first focus was getting LMs to ask better questions. By implementing Monte Carlo inference strategies, the LMs reason about potential guesses as individual particles. The ones that appear more valid with each answer from the spotter would be weighted more heavily, sort of like game balls that inflate or deflate each turn. With this more calculated, adaptive approach, the captain could make inquiries that extracted considerably more info from the spotter.<\/p>\n<p dir=\"ltr\">The scientists then turned to the widely used programming language Python to help out AI spotters. Each question the captain asked was automatically converted into an encoded command. For example, a question like, \u201cIs there a ship in column one that spans two rows?\u201d turns into instructions for the spotter LM to search the area in question and assess how wide the digital game piece is. By giving the model clear directions in a language it understands particularly well, each system gave correct answers considerably more often. The lightweight system GPT-4o-mini saw a nearly 30 percent performance bump, for instance, and even the large model Claude 4 Opus jumped about eight points.<\/p>\n<p dir=\"ltr\">\u201cThe field has seen a lot of success from \u2018auto-formalization\u2019 strategies, in which LMs generate code to verify their solutions,\u201d says senior author Jacob Andreas, an MIT electrical engineering and computer science associate professor and CSAIL principal investigator. \u201cWhat I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs\u2019 exploration and information gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving.\u201d<\/p>\n<p dir=\"ltr\"><strong>Let\u2019s play something else<\/strong><\/p>\n<p dir=\"ltr\">But how would this approach fare in other board games? The team tested their newly equipped LMs at \u201cGuess Who?\u201d, where large and small models skillfully whittled down 100 options to correctly guess which hidden character had been chosen. Llama 4 Scout was successful 30 percent of the time, but after Grand and his colleagues\u2019 tweaks, it completed the task on over 72 percent of its runs. Meanwhile, GPT-4o leapt from 62 percent to 90 percent. GPT-5 was the spotter in each game to ensure questions were answered as accurately as possible.<\/p>\n<p dir=\"ltr\">While LMs have made promising progress in both games, there\u2019s room for improvement. For instance, the models still struggle to answer complex questions, compared to humans. OpenAI researcher, recent Harvard graduate, and coauthor Valerio Pepe adds that \u201cGPT-5 can beat your average \u2018Battleship\u2019 player, and gets a hair better with our methods. However, expert players are still hard to beat for all models, unlike in chess, where even top players don\u2019t succeed against AI systems.\u201d<\/p>\n<p dir=\"ltr\">The researchers\u2019 findings show that AI agents have untapped potential in \u201cneedle-in-a-haystack\u201d discovery \u2014 navigating a massive space of options to find a rare solution to scientific challenges. While improved information-seeking skills would make them excellent research assistants with, say, identifying a compound\u2019s molecular structure, the researchers caution that \u201cCollaborative Battleship\u201d is a somewhat simple test bed. They\u2019d like to test LMs in more complex settings, where the systems have to consider far more options.<\/p>\n<p dir=\"ltr\">Grand also plans to have humans and AI models collaborate to study whether they work better together. The models might also benefit from a bit of fine-tuning on game simulations, and with more computing power, LMs would have more advanced inference capabilities to predict how a game will evolve.\u00a0<\/p>\n<p>\u201cAs AI systems become more agentic, the hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time,\u201d says Robert Hawkins, assistant professor of linguistics at Stanford University, who wasn\u2019t involved in the paper. \u201cThis work elegantly captures these phenomena in a controlled collaborative setting, and makes a compelling case that the real bottleneck for AI agents isn\u2019t just the calculation of optimal questions, but the pragmatic reasoning needed to make the most of their answers.\u201d<\/p>\n<p dir=\"ltr\">Grand and Pepe wrote the paper with two CSAIL principal investigators: MIT Associate Professor Jacob Andreas and MIT Professor Joshua Tenenbaum. Their work was supported, in part, by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechAI@CSAIL initiative, a Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation. They showcased their paper as an oral presentation at the International Conference on Learning Representations (ICLR) in April.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2026\/teaching-ai-agents-ask-better-questions-playing-battleship-0603\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Alex Shipps | MIT CSAIL In 2026, the hype for artificial intelligence agents is louder than ever before. These semi-autonomous programs can \u201cthink\u201d and [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2026\/06\/03\/teaching-ai-agents-to-ask-better-questions-by-playing-battleship\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":469,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/9096"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=9096"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/9096\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/468"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=9096"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=9096"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=9096"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}