iconScale Is Excited To Be OpenAI’s Preferred Partner for Fine-Tuning GPT-3.5!

Using GANs to Align Human Metrics

by Prachi Bodas on July 18, 2022

At first glance, human-in-the-loop processes seem like a relic of older times. Before technological advancements, humans effectively performed functions completed by computers today. For example, NASA hired large teams of arithmetic processors- men and women whose only job was to evaluate numerical expressions quickly and accurately. These calculators were pivotal to enabling space exploration.

At Scale, our labeling teams are just as critical to achieving our mission. We’ve found we’re able to effectively apply the principles behind generative adversarial networks to our human-in-the-loop processes to deliver the highest quality annotations to customers.

How does this work, exactly?

A GAN, or generative adversarial network, is given an unlabeled dataset and learns to produce new examples that are not already in the dataset but are indistinguishable from the items in the dataset. For example, a net meant to produce new images of flowers would be fed as input images of real flowers. To achieve this goal, the net has to achieve two things:

  1. The generated flowers must not be distinguishable from real flowers. That is to say, if given a set of both generated and real flower images, one should not be able to tell which is which.
  2. In achieving the above, the GAN must not simply produce copies of the real flower images.

Therefore, the net must be trained to generate new flowers, discriminate between real and fake flowers, and avoid simply replicating training data. The training loop jointly optimizes the generator (which produces new images that are not copies of the training set), and the discriminator (which attempts to discriminate between training data and generated images). The net is successful when the generator becomes accurate enough to fool the discriminator, i.e. the discriminator cannot outperform random choice when classifying generated vs non-generated examples.

In the context of human-in-the loop processes, we can apply this same principle to produce quality labels on subjective tasks. Let’s say the pipeline we’ve chosen has a minimum of two annotators seeing each task, one attempter and one reviewer.

Example Task:

Instructions: Please describe the image as accurately and descriptively as possible with a short phrase.

Possible attempter responses:

  • Attempter 1: ‘asdf’
  • Attempter 2: ‘trees’
  • Attempter 3: ‘green hills’
  • Attempter 4: ‘grassy hills with mountains, trees, houses’
  • Attempter 5: ‘grassy hills with trees and 3 houses, mountains and clouds in background’
  • Attempter 6: ‘tall trees on green grassy hill, small red and gray houses, blue mountains in background’

The attempters act as generators, and can generate any one of the above responses for this task. Attempters 1-3 have written low-quality responses. On a regular rapid project, these attempters would be disabled via evaluation tasks. However, on a generative task like the above, exact-match evaluation tasks would not guarantee quality. For this image, both Attempters 5 and 6 provided accurate and descriptive responses. Therefore, we turn instead to the discriminator to filter out the better responses.

Our discriminators are reviewers. A reviewer is the final person to look at a task and is able to see the attempters’ responses. Reviewers are the final stage before tasks are returned to the customer If we allow reviewers to edit an attempter’s response directly and write a caption themselves, we run into the same problem as with attempter evaluation tasks. 

However, once we move more into thinking of the reviewer as a GAN discriminator, the solution becomes clear. Reviewers must discriminate between good and bad responses, but they need not write any captions themselves. Although there is no 1:1 mapping between a task and a certain correct answer, we do know upon encountering a caption whether it is acceptable or unacceptable. So we give the reviewers the choice of either accepting or rejecting an attempter’s response, but we do not give them the option to change it. Now the process of evaluating their quality is simple- create evaluation tasks with pre-generated responses that are expected to be accepted or rejected appropriately.

The two parts of a GAN, generator and discriminator, work together to pass information between them. In our human network, reviewers may provide feedback upon rejecting an attempt, to help the generator learn. The discriminator, composed of both the reviewers and the reviewer benchmarks managed by the customer, learns in turn from the generator. From a variety of generated data, the customer and in turn the reviewers are able to converge on a similar understanding of the desired result.

Now that we are sure the discriminator is accurately discriminating between low and high-quality responses, we are able to train the generator. What this means is that attempters are now disabled based on how many of their responses are rejected. Over time, this leads to a state where the attempters are effective, high-quality generators- essentially a trained GAN.

This pipeline is available via Rapid’s text collection and categorization projects. Get started with Scale Rapid today!

Scale discs

The future of your industry starts here.