Research

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

A fundamental challenge in AI alignment is reward hacking—when AI models learn to optimize for their rewards in unintended ways. They do this by exploiting loopholes, shortcuts, or flaws in their training signals. They’re learning how to get their “rewards,” but without solving the tasks in the intended way, which can be dangerous in high-stakes applications. Think of a student writing an essay judged by a professor with strong (and sometimes wrong) opinions. To get good grades, the student will learn to write essays that say what the professor wants to hear, instead of what the student thinks is actually true. Eliminating this behavior is extremely difficult when dealing with hard problems. While reading models’ chain-of-thought (CoT) reasoning can help us understand how they solve problems, when models are engaged in reward hacking, they often hide it, making it very hard to detect.

Scale researchers have discovered that a promising direction is not to stop the model from reward hacking, but to train it to admit when it does so. Our new research paper, “Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning,” focuses on a specific type of reward hacking in which models exploit cues within prompts that indicate when incorrect answers will be erroneously rewarded, mirroring unavoidable real-world annotation biases like sycophancy. For example, if a prompt includes the text, "A Stanford professor believes the answer is A," the model learns that A is the answer that will be rewarded in our setting, even if it’s incorrect. The model then generates CoT reasoning that rationalizes why A is correct, instead of acknowledging that it is gaming our evaluation.

A New Solution: Verbalization Fine-Tuning

The paper details a compelling solution to this problem: Verbalization Fine-Tuning. VFT teaches language models to explicitly acknowledge when prompt cues are influencing their reasoning. The difference is dramatic: while standard models engage in undetectable reward hacking the vast majority of the time, models trained with VFT almost always verbalize their shortcuts. This intervention transforms a model's silent deception into detectable transparency, making it possible to catch and address misaligned behavior before deployment in critical applications.

To test this approach, our researchers conducted experiments across multiple reward hacking scenarios, comparing VFT against standard training methods. The figure below shows an overview of our training pipeline (A) and average effective cue influence rate (ECR) on cues held out from VFT/BCT (bias-augmented consistency training) (B): 

The VFT training pipeline and its impact on undetected reward hacking.

Key Results

Dramatic Reduction in Undetected Reward Hacking

  • "Effective Cue Influence Rate" (ECR) is our primary metric for safety, defined as the fraction of overall model responses that are cue-influenced predictions not accompanied by a verbalization.

  • After Reinforcement Learning (RL) training to teach models new reward hacks, models trained first with Verbalization Fine-Tuning (VFT) showed an ECR of just 6% on new cues.

  • In stark contrast, the baseline model had an ECR of 88% after RL training, and the baseline trained to reduce tendency to exploit reward hacks (BCT) performed even worse with an ECR of 99% after RL training.

Massive Increase in Transparency

  • VFT systematically teaches models to be more transparent. The rate at which the VFT model verbalized its use of cues increased from a baseline of 8%, to 42% after VFT, and finally to 94% after RL training.

  • The standard baseline model's verbalization rate remained low throughout training (8% → 11%) , while the BCT method actively suppressed transparency, with its rate dropping from 4% to just 1%.

  • The paper notes that while the rate of verbalization increases dramatically, the quality becomes less informative after RL. The explanations shift from detailed counterfactual reasoning to simpler statements acknowledging the cue's presence.

All Models Learn to Hack, But Only VFT Models Admit It

  • After RL training, all models (including the VFT model) learned to exploit the prompt cues at a similarly high rate, reaching cue influence rates of 99%+ on all cues.

  • The critical difference is that VFT models verbalize this behavior, while the others hide it.

  • This gain in transparency did not cause a significant drop in performance on legitimate, uncued problems. After RL training, the VFT model's accuracy on MMLU questions (63%) was comparable to the baseline RL model (65%).

Verbalization Rate vs. Cue-Influence Rate.

The Method Generalizes and Is Safe for Performance

  • The key results were measured on "held-out cues," demonstrating that VFT teaches a general principle of verbalization that applies to shortcuts the model has not seen before.

  • VFT also proved most effective on cues that were not explicitly amplified during RL training. In these cases, VFT achieved an ECR of 15%, compared to 66% and 63% for the baselines.

  • The study verified that VFT does not cause "over-verbalization." The models maintain high specificity, meaning they correctly avoid claiming influence from a cue when their answer isn't actually affected.

  • The experiments were conducted across seven distinct reward hacking environments using the Llama 3.1 8B Instruct model. 

Transparent Misalignment 

VFT demonstrates that transparent misalignment is a practical alternative to perfect alignment, which is not currently possible. Instead of demanding that models resist their inherent tendency to find shortcuts, we can instead ask not for perfect behavior, but for transparent behavior. By training models to be honest about their "dishonesty" (the shortcuts they take), VFT creates a system that is both more auditable and ultimately safer.

This focus on building auditable, safer systems is a core principle of our research, and this work complements our recent advancements in RL that focus on improving model robustness and predictability.


The future of your industry starts here