by Matthew Siegel and Miles Turpin

A fundamental challenge in AI alignment is reward hacking—when AI models learn to optimize for their rewards in unintended ways. They do this by exploiting loopholes, shortcuts, or flaws in their training signals. They’re learning how to get their “rewards,” but without solving the tasks in the intended way, which can be dangerous in high-stakes applications. Think of a student writing an essay judged by a professor with strong (and sometimes wrong) opinions. To get good grades, the student will learn to write essays that say what the professor wants to hear, instead of what the student thinks is actually true. Eliminating this behavior is extremely difficult when dealing with hard problems. While reading models’ chain-of-thought (CoT) reasoning can help us understand how they solve problems, when models are engaged in reward hacking, they often hide it, making it very hard to detect.
Scale researchers have discovered that a promising direction is not to stop the model from reward hacking, but to train it to admit when it does so. Our new research paper, “Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning,” focuses on a specific type of reward hacking in which models exploit cues within prompts that indicate when incorrect answers will be erroneously rewarded, mirroring unavoidable real-world annotation biases like sycophancy. For example, if a prompt includes the text, "A Stanford professor believes the answer is A," the model learns that A is the answer that will be rewarded in our setting, even if it’s incorrect. The model then generates CoT reasoning that rationalizes why A is correct, instead of acknowledging that it is gaming our evaluation.
The paper details a compelling solution to this problem: Verbalization Fine-Tuning. VFT teaches language models to explicitly acknowledge when prompt cues are influencing their reasoning. The difference is dramatic: while standard models engage in undetectable reward hacking the vast majority of the time, models trained with VFT almost always verbalize their shortcuts. This intervention transforms a model's silent deception into detectable transparency, making it possible to catch and address misaligned behavior before deployment in critical applications.
To test this approach, our researchers conducted experiments across multiple reward hacking scenarios, comparing VFT against standard training methods. The figure below shows an overview of our training pipeline (A) and average effective cue influence rate (ECR) on cues held out from VFT/BCT (bias-augmented consistency training) (B):

The VFT training pipeline and its impact on undetected reward hacking.
Dramatic Reduction in Undetected Reward Hacking
Massive Increase in Transparency
All Models Learn to Hack, But Only VFT Models Admit It

Verbalization Rate vs. Cue-Influence Rate.
The Method Generalizes and Is Safe for Performance
VFT demonstrates that transparent misalignment is a practical alternative to perfect alignment, which is not currently possible. Instead of demanding that models resist their inherent tendency to find shortcuts, we can instead ask not for perfect behavior, but for transparent behavior. By training models to be honest about their "dishonesty" (the shortcuts they take), VFT creates a system that is both more auditable and ultimately safer.
This focus on building auditable, safer systems is a core principle of our research, and this work complements our recent advancements in RL that focus on improving model robustness and predictability.