Research

Aligning Chatbots Before It's Too Late

byon October 24, 2024

Scale’s paper, Learning Goal-Conditioned Representations for Language Reward Models, has been accepted to the NeurIPS 2024 main track. This post overviews the research presented in that paper.
Paper Authors: Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead, Sean Hendryx

For large language models (LLMs) to power useful tools from coding assistants to autonomous browser agents, their behavior must be aligned to follow instructions given by users. Traditional methods for training language models, such as Reinforcement Learning from Human Feedback (RLHF) or methods like Direct Preference Optimization (DPO), evaluate the final product of their outputs. This approach often means missteps during generation aren't caught until it's too late, leading to more reasoning errors.

A new approach from researchers at Scale AI, inspired by goal-conditioned reinforcement learning, allows a reward model to provide early feedback throughout the generation process. By giving language models an ongoing sense of whether they're on the right track, this conditioning can improve performance on reasoning tasks.

The approach aims to enable detecting missteps during generation by extracting richer signals from traditional human preference labels. The model identifies points where the trajectory diverges from the desired outcome, allowing for resampling or other corrections without requiring direct intervention or additional fine-grained labels.

Goal-Conditioned Representations: The Value of Early Feedback

Goal-conditioned representations provide continuous feedback to language models during text generation, allowing them to adjust based on whether they are moving toward the desired outcome. Instead of evaluating only at the final output, the reward model continuously checks its progress, allowing early correction of errors.

Q-function in Goal-Conditioned Representations

Mathematically, a Q-function, denoted as Q(s, a), represents the expected cumulative reward starting from state s and taking action a, and following a specific policy of actions thereafter. In goal-conditioned reinforcement learning, the Q-function estimates the future reward of partially generated sequences, helping the model decide on the best course of action in real-time.

For example, in the GSM8k-style problem below, the model is expected to generate Python code to compute a solution. The Q-function evaluates the expected correctness of the solution after each token in the generated code, implicitly revealing the point at which an error is introduced:

Figure: Excerpt from Appendix D showing an example of the method evaluating the correctness of a math problem as the solution is being given. At the point where a clear error is introduced — multiplying the per-dozen price of a box of donuts by 12, in the fifth line of the response — the Q-fuction’s opinion of whether the answer is correct quickly turns negative.

Evaluating Partial Solutions with Goal-Conditioned Representations

Goal-conditioned representations rely on a Q-function to assess partially generated sequences and evaluate their likelihood of achieving the desired outcome.

Contrastive learning helps refine these representations by using a contrastive objective to identify promising paths. It encourages similarity to successful paths while discouraging similarity to unsuccessful ones, allowing for real-time adjustments during inference. This approach extracts fine-grained signals from pairwise preference annotations of completed outputs, improving the model's ability to learn from coarse feedback.

Benefits of Goal-Conditioned Representations

Goal-conditioned representations offer two main benefits: improved accuracy and fine-grained control. Continuous feedback enables better performance across many reasoning tasks.

Benchmark evaluations on Codellama 7B showed significant improvements when using goal-conditioned representations to guide inference. On mathematical reasoning tasks like GSM8k and MATH, this approach improved AUROC by up to 0.09 compared to standard preference ranking training. For out-of-distribution (OOD) datasets, average improvements were 0.04, with a maximum of roughly 0.08 on the svamp dataset. Additionally, the approach enhanced alignment on the Helpful-Harmless dataset by increasing accuracy by 2.3%.

Goal-conditioned representations provide fine-grained control, allowing the model to steer outputs toward desired qualities such as helpfulness or complexity. Attributes like correctness and coherence can be represented as goals to guide the model and generate better responses.

Key Takeaway

Goal-conditioned representations provide early feedback that can help align language models during inference. This method allows us to extract richer signals from traditional preference ranking datasets, leveraging intermediate representations without requiring additional human labels or fine-grained annotations. This enables better alignment with less effort, as these signals can better guide the model using existing data.


The future of your industry starts here.