Engineering

Precog: Scale's platform for data quality post-training experiments

byonJune 9, 2025

At Scale, operations, engineering, and research teams work together to ensure the quality of our data. To do this, we rely on a combination of human review, automated linters, data distribution analyses, and model training experiments. In this post, we will focus on the last category and introduce Precog, our platform for running data quality experiments by training models on our own datasets. 

Training models on our data allows us to evaluate data in the context of how the data will ultimately be used. These experiments let us catch specific bugs and make strategic improvements to our datasets: we can confidently make changes by examining how they affect the performance of downstream models. Thus, our data experimentation platform is a key component in delivering data that drives the frontier of LLM development.

In the remainder of this post, we will introduce the general workflow for these experiments, discuss a number of case studies of their impact, and then explore how the design of our platform enables fast experimentation.

Workflow

Our model training data quality experiments are similar to the ones in DataComp LM, and those performed in post-training data ablations (i.e. Tulu 3). Starting with a baseline open-weights LLM, we perform post-training with a dataset we are interested in evaluating. After training is complete, we select the best model checkpoint and evaluate it on model benchmarks, comparing results with the baseline. By inspecting the final benchmark results as well as individual model responses, we can assess data quality and decide on appropriate fixes or changes to a dataset.

Precog experiment workflow. This workflow can be extended to multiple versions or subsets of a data to get finer grained signals on the effectiveness of the data.

Data Subset GSM8k MMLU
Control (baseline model) 60 50
RLHF (PPO)  on dataset with prompting strategy A 70 55
RLHF (PPO) on dataset with prompting strategy B 55 50

Example Analysis Report. In this table, training a model on data with prompting strategy A yields stronger performance. This provides a clear signal that prompting strategy A provides better data.

Case Studies

Recently, our researchers have been running hundreds of experiments every month on the Precog platform. In many cases, we were able to confirm the quality of our datasets with model eval improvements on benchmarks like Global MMLU and MMMU. Sometimes these experiments have also inspired improvements in our dataset design. We present three case studies below.

Simplifying math formatting

In this dataset, we had human oracle responses to math questions. We compared LLMs fine-tuned on this data with baseline LLMs and found that the human responses led to worse performance on math benchmarks (MGSM). Investigating the outputs of the fine-tuned LLMs identified that overly rigorous math formatting was influencing the LLM, and we adjusted the human contributor instructions to address this.

Eliminating human response biases

In this dataset, we had both model response preference data as well as human expert-written responses. We trained reward models on different subsets of both model responses and human responses. Results on both established reward model benchmarks (PPE, RewardBench) as well as our own internal validation sets showed that training on these human responses led to reward hacking. Investigating the outputs of the reward model identified biases in the human responses. These biases included less usage of pleasantries at the beginning of a response compared to LLMs, and markdown formatting artifacts such as spacing after bullet points. Adjusting the instructions to our contributors, changing model response strategies, and adding new checks for these artifacts dramatically improved the dataset.

Removing spurious correlations in model responses preference data

In these datasets, we had preference data between different model responses and identified distribution shifts between the responses that led to model overfitting. To do this we trained preference reward models on artificial variants of our data: in some experiments we swapped chosen responses randomly and in others we omitted prompts during training. These modifications should remove most of the true preference signals we are interested in. Thus, when we observed high in-distribution validation scores for models trained on these artificial datasets, and correspondingly low external benchmark results, we inferred that the models were overfitting on spurious correlations present in the data.

With these experiments we identified and addressed issues in a variety of datasets. We discuss two of them below. 

In the first, we found that rejected responses were being consistently sourced from older models, leading to datasets that reflected unrelated changes in model style. After updating the data collection process to source both the chosen and rejected responses from the same (randomly selected) model, the trained reward models had better generalization.

In the second, we found that certain model sampling parameters led to low writing quality. After introducing checks for poor writing quality, adjusting the sampling parameters, and filtering out bad responses, the trained reward models again had better generalization.

Platform Design

We built the Precog platform to allow researchers to run data quality experiments reliably and efficiently. The core design principle has been to allow flexibility in data processing and launching multiple experiments, while standardizing the training and model benchmarking pipelines.

Each Precog experiment is defined in a self-contained notebook that performs any necessary data processing and parameter configuration. The notebook makes API calls to the Precog platform for training, model evals, job management, and compiling results. This makes each experiment reproducible, and allows the Precog platform standardize job hardware configurations, as well as the training and benchmarking logic. This also gives us visibility into all of the current and previous experiments across the company.

The most important APIs are the ones that enable standardized post-training and LLM eval jobs, and we will discuss these below. Both training and eval jobs are executed on our kubernetes cluster and scheduled using kueue, with users customizing their job priority and hardware resource requirements as needed.

Training

API Sketch: 

POST /training_job
 training_config:
   base_model
   dataset: s3_path
   training_alg: SFT | REWARD | DPO | PPO | GRPO
   hyperparams:
     ...
 priority: LOW | MED | HIGH
 hardware_config
 rlxf_version

We expose a simplified training pipeline for reproducible experiments in Precog. Users can specify a base model (usually an instruction tuned open-weights model from the Llama family), a training dataset (snapshotted in S3), a training algorithm, and other parameters. 

For the training algorithms, we use our in-house post-training platform: RLXF. We run experiments on model sizes ranging from 3B to 400B+ parameters, and context length up to 1M. To enable this, RLXF is built upon PyTorch native APIs, and leverages the Ray framework and vLLM for efficient, scalable and robust post-training on a variety of internal hardware configurations (single GPUs to many nodes with 5D parallelism strategies). 

Our training algorithm implementations are under continuous development, so we adopt a simple versioning scheme where RLXF releases are snapshotted as tagged docker images for reproducibility.

Evals

API Sketch: 

POST /eval_job
 model_checkpoint_path: s3_path
 benchmark_name: GSM8K | MMLU | LIVECODEBENCH | ...
 harness_version
 harness_params:
   temperature
   max_tokens
   ...
 priority: LOW | MED | HIGH
 hardware_config

To support experiments on a wide variety of datasets at Scale, our evals also aim to provide broad domain coverage. Currently they cover coding, reasoning, instruction following, visual reasoning, and many others. We build on top of open source harnesses such as lm-eval-harness with some additional extensions and orchestration layers that standardize eval parameter configuration and then save the aggregate metrics and per-testcase responses in cloud storage for later inspection. 

One key challenge for running evals is having stable baselines: small differences in prompting strategies or even python library versions can have a significant impact on the final eval results. Thus, in Precog every LLM benchmark is implemented as a versioned and parameterized docker image, with standardized hardware configurations for executing them on different sizes of models (8B, 30B, 70B, etc...). At times we have found limitations and bugs in open source implementations, so we make use of regression tests to align our results with published papers and leaderboards.

Conclusion

By investing in a platform for reproducible model training and eval experiments, we have been able to drive improvements in data quality across multiple projects at Scale. Keeping up with the latest model training improvements and types of evals is a recurring challenge, but the impact of these investments has been to allow researchers easy access to feedback on the quality of their data, closing the loop between the design of our data production pipelines and how the data is produced downstreams.

Acknowledgements:

Core Eng Contributors: 
Edward Gan*, Anant Marur, Ian Macleod, Meher Mankikar, Yi Xu

Core Research Contributors:
Lifeng Jin, Swarnashree Mysore Sathyendra, Rakshith Sharma Srinivasa, Robert Vacareanu, Zihao Wang, Bing Liu, Summer Yue

Special thanks to Qin Lyu, Zijian Hu, and Nikhil Barhate for their work on RLXF, and Will Song and Tiffany Zhao for their work on our kubernetes cluster.

*Blog Post Author


The future of your industry starts here.