Enterprise Reinforcement Learning with Rubrics as Rewards

Many enterprise problems do not have simple black and white solutions. This gray area is precisely where common Verifiable Reward-based post-training methods fall short, as they rely on simple yes/no reward signals that often fail to capture the multi-faceted nature of enterprise needs. To solve this problem, Scale researchers developed a new method called Rubrics as Rewards (RaR) that extends another recent post-training method called Reinforcement Learning with Verifiable Rewards (RLVR).
RLVR is strongest with objective problems because it uses a simple, programmatic check (like matching a final answer or passing a test case) to provide a clear "correct" or "incorrect" reward signal. RaR adapts this approach for more subjective areas by breaking a complex problem down into a checklist of smaller, verifiable sub-questions in the form of a rubric. This rubric replaces the single objective answer with a detailed, multi-faceted evaluation, improving accuracy on tasks that lack a single ground truth and enabling us to build agents that learn, adapt, and discover novel solutions to challenging problems.
How RaR Works
At the core of RaR is a two-model reinforcement learning loop powered by Generalized Reinforcement Policy Optimization (GRPO). Here’s how it works:
-
A student model receives the full case background or problem input and produces both a reasoning trace and a conclusion in structured form.
-
A judge model evaluates the student’s output against the rubric and ground truth, returning both an overall numeric score and sub-scores for each criterion.
-
The numeric score serves as the reward signal for GRPO, allowing the student to continually refine its performance.
The rubrics themselves are structured and interpretable. Each rubric contains 7 to 20 items, organized into four categories:
-
Essential (critical for correctness)
-
Important (key reasoning and completeness)
-
Optional (helpful but not required)
-
Pitfall (common mistakes to avoid)
This design makes evaluation consistent, transparent, and auditable. Unlike opaque preference-based methods (where models learn from simple "A is better than B" labels), rubric-based training lets every reward signal be traced back to explicit, human-understandable criteria. Training is further guided by tuning key hyperparameters (the system’s technical "dials") to balance raw predictive accuracy with high-quality reasoning. The goal is not only to predict the right answer but also to explain it in a way that aligns with expert standards.
Enterprise Use Case: Training an AI Legal Analyst
To test this method, we trained a small, open-weight language model to generate case analyses with rubric-aligned reasoning and conclusions on a small, focused dataset of 51 training and 41 test cases. The goal was to demonstrate that we could build a model with greater transparency, control, and adaptability than a closed-source frontier LLM, at a fraction of the cost.
For instance, the student model might receive a prompt detailing the specifics of a case:
The central issue in this case is whether inventory services provided by the Plaintiff fall under an exemption for 'electronic data processing services' and related software, as set forth by West Virginia law. The Plaintiff, RGIS Inventory Specialists, argues that its services, which involve using proprietary equipment and software to conduct physical inventories for retail stores, should be classified as exempt. The State Tax Commissioner contends that these services are not primarily 'electronic data processing' but rather a taxable service of counting physical inventory, making them subject to sales tax.
The judge model would then evaluate the student's output against a detailed rubric containing criteria like:
Correctness: Did the reasoning correctly identify that the court affirmed the Tax Commissioner's assessment, ruling against the taxpayer?
Statute Application: Does the reasoning correctly apply West Virginia Code § 11-15-9(a)(7) and its definition of "electronic data processing services?"
Core Argument: Does the reasoning explain that the "true object" of the service was determined to be the counting of inventory, not the processing of data?
Completeness: Did the reasoning mention the court's analysis of the taxpayer's proprietary equipment and software?
We tailored the student–judge loop for the legal domain. The student model was prompted to reason using an IRAC (Issue, Rule, Application) structure to mimic expert legal analysis, outputting its findings in a structured JSON format. While the original RaR research used domain-agnostic rubrics, we designed a highly specific, task-aligned rubric for law. The judge model then used a rubric designed for law to score outputs on criteria such as correctness against the ground truth, coverage of facts, citation accuracy, and alignment with the provided statutes.
The final score was determined through implicit aggregation, where the judge holistically evaluates the entire response against the rubric, a method proven more effective than simple point-summing.
Small Model, Superior Performance
Our findings show that a small model, when fine-tuned with a domain-specific rubric, can match or outperform a much larger, general-purpose model on a specialized task. The student model is optimized to produce IRAC-structured reasoning with correct statute use, coverage of facts, and avoidance of common pitfalls, with outcome (case-winner) correctness as one criterion among many. On a 41-case test set, this holistic training translated into stronger outcome accuracy. The results from our experiment, which used a test set of 41 cases, are below:
Our findings show that a small model, fine-tuned with a domain-specific rubric, can match or outperform a much larger, general-purpose model on a specialized task. The student model was optimized to produce IRAC-structured reasoning, ensure correct statute use, cover relevant facts, and avoid common pitfalls—with outcome correctness being just one criterion among many. On our 41-case test set, this holistic training translated directly into stronger outcome accuracy. The results are below:
Run Name | Manual Score | |
Qwen3-4B-Instruct-2507 | 71.79 | zero-shot |
GPT-4.1 | 76.92 | zero-shot |
Qwen3-4b-instruct-rubric | 79.49 | finetuned |
The small Qwen3-4B-Instruct-2507 model, when trained with our rubric-based method, not only improved its accuracy but also surpassed the zero-shot performance of the much larger GPT-4.1.
This outcome matches what we found in our large-scale study. In that research, RaR achieved up to a 28 percent relative improvement on the HealthBench-1k medical benchmark and strong gains on the GPQA-Diamond science benchmark, proving the method scales across domains. Beyond surpassing binary correctness rewards, RaR consistently outperformed preference-based baselines such as Simple-Likert and Reference-Likert scoring, showing that structured rubrics provide a more reliable and interpretable signal.
This powerful result is not unique to legal analysis. In a separate experiment focused on taxability determination for business transactions, a similarly trained small model achieved an impressive ~98% accuracy, again significantly outperforming large, general-purpose models on the same task. This further validates the effectiveness of using specialized, rubric-trained models for high-stakes enterprise workflows.
Why This Matters for Enterprise
Many critical enterprise tasks do not have a simple yes/no answer, and there may be insufficient data to train a model effectively with traditional methods. However, if you have examples of what a successful reasoning chain looks like (what your experts consider a high-quality output), then you can construct a rubric. This allows you to train smaller, more efficient models on the complex, subjective tasks that drive your business.
This efficiency extends to the evaluation process itself. Rubrics allow smaller, more cost-effective judge models to score responses with accuracy that rivals much larger models. Additionally, our process allows for very small datasets to produce outsized results. For enterprises, this means scalable and affordable AI training with control over the criteria that matter most. RaR is especially valuable because smaller open-weight models, trained with rubrics, can rival and even outperform much larger closed-source models. That means lower costs, more transparency, and tighter control, advantages that directly translate into enterprise readiness.
Small Models Can Punch Above Their Weight
Our method enables us to train smaller, open-source models to match or outperform larger frontier models on enterprise tasks by directly steering the reasoning process with reinforcement learning. Scale is working with leading enterprises to implement these techniques, enabling our customers to deploy agents that learn and adapt to their specific processes, delivering superior performance while reducing implementation time.
Looking ahead, our research points to exciting directions: using rubrics as implicit curricula (where models gradually master harder criteria), extending RaR to open-ended agentic tasks, and evaluating robustness against reward hacking. These directions highlight RaR not only as a present solution but also as a foundation for future enterprise AI training.
To learn more about how Scale can help you apply reinforcement learning to your workflows, book a meeting below.