BlogResearch

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

by Matthew Siegel and Miles Turpin

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking
Published
Reading Time6 min read