{"id":8326,"date":"2025-07-21T19:00:00","date_gmt":"2025-07-21T19:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2025\/07\/21\/a-new-way-to-edit-or-generate-images\/"},"modified":"2025-07-21T19:00:00","modified_gmt":"2025-07-21T19:00:00","slug":"a-new-way-to-edit-or-generate-images","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2025\/07\/21\/a-new-way-to-edit-or-generate-images\/","title":{"rendered":"A new way to edit or generate images"},"content":{"rendered":"<p>Author: Steve Nadis | MIT CSAIL | Laboratory for Information and Decision Systems<\/p>\n<div>\n<p>AI image generation \u2014 which relies on neural networks to create new images from a variety of inputs, including text prompts \u2014 is projected to become a billion-dollar industry by the end of this decade. Even with today\u2019s technology, if you wanted to make a fanciful picture of, say, a friend planting a flag on Mars or heedlessly flying into a black hole, it could take less than a second. However, before they can perform tasks like that, image generators are commonly trained on massive datasets containing millions of images that are often paired with associated text. Training these generative models can be an arduous chore that takes weeks or months, consuming vast computational resources in the process.<\/p>\n<p>But what if it were possible to generate images through AI methods without using a generator at all? That real possibility, along with other intriguing ideas, was described in a <a href=\"https:\/\/arxiv.org\/pdf\/2506.08257\">research paper<\/a> presented at the International Conference on Machine Learning (ICML 2025), which was held in Vancouver, British Columbia, earlier this summer. The paper, describing novel techniques for manipulating and generating images, was written by Lukas Lao Beyer, a graduate student researcher in MIT\u2019s Laboratory for Information and Decision Systems (LIDS); Tianhong Li, a postdoc at MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL); Xinlei Chen of Facebook AI Research; Sertac Karaman, an MIT professor of aeronautics and astronautics and the director of LIDS; and Kaiming He, an MIT associate professor of electrical engineering and computer science.<\/p>\n<p>This group effort had its origins in a class project for a graduate seminar on deep generative models that Lao Beyer took last fall. In conversations during the semester, it became apparent to both Lao Beyer and He, who taught the seminar, that this research had real potential, which went far beyond the confines of a typical homework assignment. Other collaborators were soon brought into the endeavor.<\/p>\n<p>The starting point for Lao Beyer\u2019s inquiry was a June 2024 paper, written by researchers from the Technical University of Munich and the Chinese company ByteDance, which introduced a new way of representing visual information called a one-dimensional tokenizer. With this device, which is also a kind of neural network, a 256&#215;256-pixel image can be translated into a sequence of just 32 numbers, called tokens. \u201cI wanted to understand how such a high level of compression could be achieved, and what the tokens themselves actually represented,\u201d says Lao Beyer.<\/p>\n<p>The previous generation of tokenizers would typically break up the same image into an array of 16&#215;16 tokens \u2014 with each token encapsulating information, in highly condensed form, that corresponds to a specific portion of the original image. The new 1D tokenizers can encode an image more efficiently, using far fewer tokens overall, and these tokens are able to capture information about the entire image, not just a single quadrant. Each of these tokens, moreover, is a 12-digit number consisting of 1s and 0s, allowing for 2<sup>12<\/sup> (or about 4,000) possibilities altogether. \u201cIt\u2019s like a vocabulary of 4,000 words that makes up an abstract, hidden language spoken by the computer,\u201d He explains. \u201cIt\u2019s not like a human language, but we can still try to find out what it means.\u201d<\/p>\n<p>That\u2019s exactly what Lao Beyer had initially set out to explore \u2014 work that provided the seed for the ICML 2025 paper. The approach he took was pretty straightforward. If you want to find out what a particular token does, Lao Beyer says, \u201cyou can just take it out, swap in some random value, and see if there is a recognizable change in the output.\u201d Replacing one token, he found, changes the image quality, turning a low-resolution image into a high-resolution image or vice versa. Another token affected the blurriness in the background, while another still influenced the brightness. He also found a token that\u2019s related to the \u201cpose,\u201d meaning that, in the image of a robin, for instance, the bird\u2019s head might shift from right to left.<\/p>\n<p>\u201cThis was a never-before-seen result, as no one had observed visually identifiable changes from manipulating tokens,\u201d Lao Beyer says. The finding raised the possibility of a new approach to editing images. And the MIT group has shown, in fact, how this process can be streamlined and automated, so that tokens don\u2019t have to be modified by hand, one at a time.<\/p>\n<p>He and his colleagues achieved an even more consequential result involving image generation. A system capable of generating images normally requires a tokenizer, which compresses and encodes visual data, along with a generator that can combine and arrange these compact representations in order to create novel images. The MIT researchers found a way to create images without using a generator at all. Their new approach makes use of a 1D tokenizer and a so-called detokenizer (also known as a decoder), which can reconstruct an image from a string of tokens. However, with guidance provided by an off-the-shelf neural network called CLIP \u2014\u00a0which cannot generate images on its own, but can measure how well a given image matches a certain text prompt\u00a0\u2014 the team was able to convert an image of a red panda, for example, into a tiger. In addition, they could create images of a tiger, or any other desired form, starting completely from scratch \u2014 from a situation in which all the tokens are initially assigned random values (and then iteratively tweaked so that the reconstructed image increasingly matches the desired text prompt).<\/p>\n<p>The group demonstrated that with this same setup \u2014 relying on a tokenizer and detokenizer, but no generator \u2014 they could also do \u201cinpainting,\u201d which means filling in parts of images that had somehow been blotted out. Avoiding the use of a generator for certain tasks could lead to a significant reduction in computational costs because generators, as mentioned, normally require extensive training.<\/p>\n<p>What might seem odd about this team\u2019s contributions, He explains, \u201cis that we didn\u2019t invent anything new. We didn\u2019t invent a 1D tokenizer, and we didn\u2019t invent the CLIP model, either. But we did discover that new capabilities can arise when you put all these pieces together.\u201d<\/p>\n<p>\u201cThis work redefines the role of tokenizers,\u201d comments\u00a0Saining Xie, a computer scientist at New York University. \u201cIt shows that\u00a0image tokenizers \u2014 tools usually used just to compress images \u2014 can actually do a lot more. The fact that a simple (but highly compressed) 1D tokenizer can handle tasks like inpainting or text-guided editing, without needing to train a full-blown generative model, is pretty surprising.\u201d<\/p>\n<p>Zhuang Liu of Princeton University agrees, saying that the work of the MIT group\u00a0\u201cshows that we can generate and manipulate the images in a way that is much easier than we previously thought. Basically, it demonstrates that image generation can be a byproduct of a very effective image compressor, potentially reducing the cost of generating images several-fold.\u201d<\/p>\n<p>There could be many applications outside the field of computer vision, Karaman suggests. \u201cFor instance,\u00a0we could consider tokenizing the actions of robots or self-driving cars in the same way, which may rapidly broaden the impact of this work.\u201d<\/p>\n<p>Lao Beyer is thinking along similar lines,\u00a0noting that the\u00a0extreme amount of compression afforded by 1D tokenizers allows you to do \u201csome amazing things,\u201d which could be applied to other fields. For example, in the area of self-driving cars, which is one of his research interests, the tokens could represent, instead of images, the different routes that a vehicle might take.<\/p>\n<p>Xie is also intrigued by the applications that may come from these innovative ideas. \u201cThere are some really cool use cases this could unlock,\u201d he says.\u00a0<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2025\/new-way-edit-or-generate-images-0721\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Steve Nadis | MIT CSAIL | Laboratory for Information and Decision Systems AI image generation \u2014 which relies on neural networks to create new [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2025\/07\/21\/a-new-way-to-edit-or-generate-images\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":471,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/8326"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=8326"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/8326\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/463"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=8326"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=8326"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=8326"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}