The Squeezing Effect: Why Your Aligned AI Model Gets Worse

The Squeezing Effect: Why Your Aligned AI Model Gets Worse

Source: Dev.to

What's Actually Happening ## Why This Matters ## The Fix: Lift the Valley ## Why This Matters for Your AI Implementation You've spent weeks optimizing your language model. You've applied preference optimization (DPO), run the alignment pipeline, and expected performance gains. Instead, your model's confidence drops across the board—even on the outputs you wanted to improve. This isn't a bug. It's the squeezing effect, and understanding it changes how you approach AI implementation. When you finetune an LLM using off-policy preference optimization, the algorithm imposes two competing gradients: A positive push on preferred responses (get better at this). A negative push on rejected responses (get worse at this). In theory, this should improve the margin between good and bad outputs. In practice, something counterintuitive occurs. The negative gradient doesn't distribute the lost probability mass evenly across all other tokens. Instead, it concentrates almost all of it into whichever token already had the highest confidence. This is the squeezing effect. Imagine a balloon filled with air. You press down on one side (the rejected response). The air doesn't escape smoothly to all corners of the room. It gets squeezed into the area with the least resistance—the highest point of the balloon. In a typical LLM, after pretraining, the distribution is highly peaked. Your model is already very confident about certain tokens—those represent the core knowledge it learned. Rejected responses typically fall into low-probability regions. When you apply a large negative gradient to something already unlikely, you don't encourage the model to explore better alternatives. Instead, you force all probability mass toward the peak. The result? Your model generates responses that are increasingly repetitive, stereotypical, and hallucination-prone. It's not actually learning to prefer better outputs. It's just getting more extreme—more peaked, narrower, less diverse. Research published at ICLR 2025 demonstrates this effect across multiple model sizes and datasets. The longer you train with this approach, the worse it gets. Models finetuned for longer periods before preference optimization show stronger degradation, because the initial distribution is already peakier. There's a remarkably simple solution: train on both preferred and rejected responses during the supervised finetuning (SFT) phase, before running DPO. By doing this, you shift rejected responses out of the probability valley. They move into a moderate-confidence region. When DPO applies its negative gradient, it's working on a flatter, gentler slope. The squeezing effect still occurs—it's built into how softmax works—but the damage is minimal because you're not pushing from an extreme low. The results speak for themselves. Models trained with this approach show: If you're deploying LLMs in production—whether for customer service, content generation, document analysis, or retail automation—understanding these dynamics is critical. Most off-the-shelf finetuning approaches use standard DPO. They assume the algorithm works as theoretically intended. But real-world model geometry introduces complications. This isn't about choosing a better training algorithm. It's about recognizing that the geometry of your model's confidence landscape influences alignment effectiveness. A small preprocessing step—including rejected responses during SFT—costs nothing but pays dividends. The takeaway: alignment isn't just mathematics on paper. It's substrate-specific. Your model's actual learned representations matter. The way probability mass distributes through the softmax layer matters. When building AI systems, these details compound. Small geometric insights become large performance gaps. The squeezing effect shows why on-policy methods consistently outperform off-policy approaches. It also explains why some implementations hit unexpected walls despite theoretically sound training pipelines. Next time you're finetuning an LLM, remember: it's not just about the loss function. It's about where you're applying pressure and what geometry you're pushing against. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - 8-15% improvement in win-rate evaluations (ChatGPT + Claude3 paired comparisons) - Fewer degenerative responses (repetitive phrases, hallucinations) - Better performance across multiple model sizes - Consistent gains without additional computational overhead