{"id":8089,"date":"2025-04-14T21:50:00","date_gmt":"2025-04-14T21:50:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2025\/04\/14\/training-llms-to-self-detoxify-their-language\/"},"modified":"2025-04-14T21:50:00","modified_gmt":"2025-04-14T21:50:00","slug":"training-llms-to-self-detoxify-their-language","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2025\/04\/14\/training-llms-to-self-detoxify-their-language\/","title":{"rendered":"Training LLMs to self-detoxify their language"},"content":{"rendered":"<p>Author: Lauren Hinkel | MIT-IBM Watson AI Lab<\/p>\n<div>\n<p>As we mature from childhood, our vocabulary \u2014 as well as the ways we use it \u2014 grows, and our experiences become richer, allowing us to think, reason, and interact with others with specificity and intention. Accordingly, our word choices evolve to align with our personal values, ethics, cultural norms, and views. Over time, most of us develop an internal \u201cguide\u201d that enables us to learn context behind conversation; it also frequently directs us away from sharing information and sentiments that are, or could be, harmful or inappropriate. As it turns out, large language models (LLMs) \u2014 which are trained on extensive, public datasets and therefore often have biases and toxic language baked in \u2014 can gain a similar capacity to moderate their own language.<\/p>\n<p>A new method from MIT, the MIT-IBM Watson AI Lab, and IBM Research, called self-disciplined autoregressive sampling (SASA), allows LLMs to detoxify their own outputs, without sacrificing fluency.\u00a0<\/p>\n<p>Unlike other detoxifying methods, this decoding algorithm learns a boundary between toxic\/nontoxic subspaces within the LLM\u2019s own internal representation, without altering the parameters of the model, the need for retraining, or an external reward model. Then, during inference, the algorithm assesses the toxicity value of the partially generated phrase: tokens (words) already generated and accepted, along with each potential new token that could reasonably be chosen for proximity to the classifier boundary. Next, it selects a word option that places the phrase in the nontoxic space, ultimately offering a fast and efficient way to generate less-toxic language.<\/p>\n<p>\u201cWe wanted to find out a way with any existing language model [that], during the generation process,\u00a0the decoding can be subject to some human values; the example here we are taking is toxicity,\u201d\u00a0says the study\u2019s lead author Ching-Yun \u201cIrene\u201d Ko PhD \u201924, a former graduate intern with the MIT-IBM Watson AI Lab and a current research scientist at IBM\u2019s Thomas J. Watson Research Center in New York.<\/p>\n<p>Ko\u2019s co-authors include Luca Daniel, professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and Ko\u2019s graduate advisor; and several members of the MIT-IBM Watson AI Lab and\/or IBM Research \u2014 Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, and Tejaswini Pedapati. The work will be presented at the International Conference on Learning Representations.<\/p>\n<p><strong>Finding the \u201cguardrails\u201d<\/strong><\/p>\n<p>The training resources behind LLMs almost always include content collected from public spaces like the internet and other readily available datasets. As such, curse words and bullying\/unpalatable language is a component, although some of it is in the context of literary works. It then follows that LLMs can innately produce \u2014 or be tricked into generating \u2014 dangerous and\/or biased content, which often contains disagreeable words or hateful language, even from innocuous prompts. Further, it\u2019s been found that they can learn and amplify language that\u2019s not preferred or even detrimental for many applications and downstream tasks \u2014 leading to the need for mitigation or correction strategies.<\/p>\n<p>There are many ways to achieve robust language generation that\u2019s fair and value-aligned. Some methods use LLM retraining\u00a0with a sanitized dataset, which is costly, takes time, and may alter the LLM\u2019s performance; others employ decoding external reward models, like sampling or beam search, which take longer to run and require more memory. In the case of SASA, Ko, Daniel, and the IBM Research team developed a method that leverages the autoregressive nature of LLMs, and using a decoding-based strategy during the LLM\u2019s inference, gradually steers the generation \u2014 one token at a time \u2014 away from unsavory or undesired outputs and toward better language.<\/p>\n<p>The research group achieved this by building a linear classifier that operates on the learned subspace from the LLM\u2019s embedding. When LLMs are trained, words with similar meanings are placed closely together in vector space and further away from dissimilar words; the researchers hypothesized that an LLM\u2019s embedding would therefore also capture contextual information, which could be used for detoxification. The researchers used datasets that contained sets of a prompt (first half of a sentence or thought), a response (the completion of that sentence), and human-attributed annotation, like toxic or nontoxic, preferred or not preferred, with continuous labels from 0-1, denoting increasing toxicity. A Bayes-optimal classifier was then applied to learn and figuratively draw a line between the binary subspaces within the sentence embeddings, represented by positive values (nontoxic space) and negative numbers (toxic space).\u00a0<\/p>\n<p>The SASA system then works by re-weighting the sampling probabilities of newest potential token based on the value of it and the generated phrase\u2019s distance to the classifier, with the goal of remaining close to the original sampling distribution.<\/p>\n<p>To illustrate, if a user is generating a potential token #12 in a sentence, the LLM will look over its full vocabulary for a reasonable word, based on the 11 words that came before it, and using top-k, top-p, it will filter and produce roughly 10 tokens to select from. SASA then evaluates each of those tokens in the partially completed sentence for its proximity to the classifier (i.e., the value of tokens 1-11, plus each potential token 12). Tokens that produce sentences in the positive space are encouraged, while those in the negative space are penalized. Additionally, the further away from the classifier, the stronger the impact.<\/p>\n<p>\u201cThe goal is to change the autoregressive sampling process by re-weighting the probability of good tokens. If the next token is likely to be toxic given the context, then we are going to reduce the sampling probability for those prone to be toxic tokens,\u201d says Ko. The researchers chose to do it this way \u201cbecause the things we say, whether it\u2019s benign or not, is subject to the context.\u201d<\/p>\n<p><strong>Tamping down toxicity for value matching<\/strong><\/p>\n<p>The researchers evaluated their method against several baseline interventions with three LLMs of increasing size; all were transformers and autoregressive-based: GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct, with 762 million, 7 billion, and 8 billion parameters respectively. For each prompt, the LLM was tasked with completing the sentence\/phrase 25 times, and PerspectiveAPI scored them from 0 to 1, with anything over 0.5 being toxic. The team looked at two metrics: the average maximum toxicity score over the 25 generations for all the prompts, and the toxic rate, which was the probability of producing at least one toxic phrase over 25 generations. Reduced fluency (and therefore increased perplexity) were also analyzed. SASA was tested to complete RealToxicityPrompts (RPT), BOLD, and AttaQ datasets, which contained naturally occurring, English sentence prompts.<\/p>\n<p>The researchers ramped up the complexity of their trials for detoxification by SASA, beginning with nontoxic prompts from the RPT dataset, looking for harmful sentence completions. Then, they escalated it to more challenging prompts from RPT that were more likely to produce concerning results, and as well applied SASA to the instruction-tuned model to assess if their technique could further reduce unwanted ouputs. They also used the BOLD and AttaQ benchmarks to examine the general applicability of SASA in detoxification. With the BOLD dataset, the researchers further looked for gender bias in language generations and tried to achieve a balanced toxic rate between the genders. Lastly, the team looked at runtime, memory usage, and how SASA could be combined with word filtering to achieve healthy and\/or helpful language generation.<\/p>\n<p>\u201cIf we think about how human beings think and react in the world, we do see bad things, so it\u2019s not about allowing the language model to see only the good things. It\u2019s about understanding the full spectrum \u2014 both good and bad,\u201d says Ko, \u201cand choosing to uphold our values when we speak and act.\u201d<\/p>\n<p>Overall, SASA achieved significant toxic language generation reductions, performing on par with RAD, a state-of-the-art external reward model technique. However, it was universally observed that stronger detoxification accompanied a decrease in fluency. Before intervention, the LLMs produced more toxic responses for female labeled prompts than male; however, SASA was able to also significantly cut down harmful responses, making them more equalized. Similarly, word filtering on top of SASA did markedly lower toxicity levels, but it also hindered the ability of the LLM to respond coherently.<\/p>\n<p>A great aspect of this work is that it\u2019s a well-defined, constrained optimization problem, says Ko, meaning that balance between open language generation that sounds natural and the need to reduce unwanted language can be achieved and tuned.<\/p>\n<p>Further, Ko says, SASA could work well for multiple attributes in the future: \u201cFor human beings, we have multiple human values. We don\u2019t want to say toxic things, but we also want to be truthful, helpful, and loyal \u2026 If you were to fine-tune a model for all of these values, it would require more computational resources and, of course, additional training.\u201d On account of the lightweight manner of SASA, it could easily be applied in these circumstances: \u201cIf you want to work with multiple values, it\u2019s simply checking the generation\u2019s position in multiple subspaces. It only adds marginal overhead in terms of the compute and parameters,\u201d says Ko, leading to more positive, fair, and principle-aligned language.<\/p>\n<p>This work was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2025\/training-llms-self-detoxify-their-language-0414\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Lauren Hinkel | MIT-IBM Watson AI Lab As we mature from childhood, our vocabulary \u2014 as well as the ways we use it \u2014 [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2025\/04\/14\/training-llms-to-self-detoxify-their-language\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":467,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/8089"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=8089"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/8089\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/460"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=8089"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=8089"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=8089"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}