Miles Turpin | Scale AI Blog

Research

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

One of AI's biggest challenges is "reward hacking," where models learn to game the system for a correct answer instead of actually reasoning. This hidden deception makes AI untrustworthy. Scale research has found a powerful solution: instead of stopping the hacking, get the model to admit to it in its Chain-of-Thought reasoning. This new paper details how Verbalization Fine-Tuning (VFT) trains models to announce their shortcuts, dramatically increasing transparency from 11% to 94% and making AI systems fundamentally safer.