Scale Rapid is Scale AI’s latest offering that makes high-quality annotation at speed accessible to both individual developers and larger teams. In this blog, we decided to put Rapid to the test by annotating the public IMDB movie review dataset and creating a model that can accurately determine the sentiment of new movie reviews.
To do so, we used Scale Rapid to label the data fed into the model. The ask for Rapid was: look through a set of movie reviews and categorize the review’s sentiment as one of positive, negative, or neutral. For positive and negative ground-truth tasks, 93% and 96% respectively were correctly marked by Scale Rapid.
The dataset we used is a public scrape of movie reviews on the IMDB site. To give you an idea of the varied standards of language, here are a few (abridged) excerpts from the dataset:
The reviews run the gamut from basic and straightforward, to misspelled and confusing, to advanced vocabulary and language, which adds complexity to the data. There are other quirks of language, too- labelers must identify sarcasm and hyperbole (for example, ‘this was the BEST movie I’ve EVER seen, especially because I wasted three hours of my life on it’).
We used categorical cross-entropy loss with the adam optimizer.
In the original dataset, reviews are marked “negative” or “positive” sentiment. But to make our model a little smarter, we’ll add a neutral option.
There are two major mechanisms for ensuring quality on Rapid - the first is instructions, and the second is training and evaluation tasks. Rapid’s Taskers are well-trained for each specific task and are incredibly adept. To ensure the Taskers label data correctly, detailed and thorough instructions are critical. In addition to the initial training, Rapid’s automated quality systems continue to monitor each Tasker to ensure instructions are being followed.
To write the instructions, we can either start from scratch, or follow a pre-filled template. The following is an example of a filled-out Rapid template:
The purpose of this task is to describe sentiment. The tasks are reviews of movies. If the writer enjoyed the movie, mark "Positive". If not, mark "Negative". If the review seemed neutral, neither positive nor negative, mark "Neutral."
Label: Positive Example: I loved Toy Story! The visuals were compelling and my daughter didn't cry once, even though she cries during most movies.
Label: Negative Example: Toy Story was not very good. The characters were flat and I felt I overpaid for my movie ticket.
Label: Neutral Example: Toy Story was all right. It's a generic children's movie, not much there for the average viewer.
This was enough to get the first batch (a calibration batch) labeled. Calibration batches are small batches of a few tasks. Labelers can submit feedback on instructions and points of confusion in these batches, so we’re able to adjust our quality tasks and instructions to effectively teach good quality.
In this case, we had to reject one label, but all the other results were good, so we proceeded to upload a production batch.
The neutral results were also great (although I haven’t audited all 800). Examples:
The labelers have not only done well identifying neutrality, but they’ve also done a great job of inferring that reviews that don’t address the movie (for example, the one above) are inherently neutral.
The model, running on labeled data for both the training and testing sets, achieved an accuracy of 99.18% on the test set.
The following are a few examples of the correctly predicted neutral labels from the test set:
Conclusion: Scale Rapid is a quick, effective way to label complex tasks for machine learning.