by Matthew Siegel and Sean Hendryx

What will the learning environments that train artificial super-intelligence look like? We can find clues in the organizations that consistently produce the world's greatest human talent and breakthroughs. There are two key traits always present, whether within frontier labs, top-tier universities, or leading tech companies: (1) an environment with rich, unhackable rewards that signal true success, and (2) a collaborative, multi-agent structure where interaction produces multiplicative returns.
While recent AI research has shown consistent progress from reinforcement learning with verifiable rewards (RLVR) and separate advancements in the multi-agent space, we believe the next frontier lies in their integration. The most advanced AI systems of the future will not learn in isolation but in rich interactive ecosystems that combine objective, verifiable rewards with multi-agent learning.
In two recent papers, Scale researchers make foundational steps in this direction. We demonstrate novel methods that integrate dynamic natural language feedbackāa core component of collaborative learningāwith verifiable rewards in a multi-agent, student-teacher setup. While in our initial experiments the student is trained while the teacher is frozen, we expect in the future this process will be iterated or that these models will co-evolve together.

In this paper, we first investigate the fundamental mechanics of how models learn via RLVR and then use those insights to accelerate progress.
In this paper, we apply the student-teacher concept to the complex, multi-step domain of software engineering, where reward signals are incredibly sparse.
At its core, this interactive student-teacher framework is effective because it directly addresses the critical bottleneck in agentic learning: sparse rewards. In a complex task like coding, an agent can make thousands of changes without ever stumbling upon a solution that passes unit tests and thereby receiving no learning signal. As the length and complexity of tasks increase, the likelihood of an agent reaching a goal state on its own decreases exponentially. By offering a targeted hint in natural language, the agents collaboratively collapse an impossibly vast search space into a manageable one, allowing the student to discover a successful trajectory far more efficiently. This mimics how humans learn complex skills through a feedback loop with mentors who provide crucial, timely guidance, rather than through pure, unguided trial and error.
Ultimately, this work is a step toward building AI systems that learn from a broader and more sophisticated diversity of signals. The learning environments of the future will be defined by this rich interplay between objective, verifiable feedback from the environment and collaborative, natural language feedback from other agents. As we scale these methods to even more complex domains and larger models, we anticipate this collaborative learning paradigm will become essential infrastructure for training the next generation of AI systems.
Core Research Contributors: Jeff Da, Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Clinton Wang, Nikhil Barhate, Xiang Deng, Tommy Ma, Sean Hendryx