When setting up a new machine learning (ML) system, collecting and annotating data can be a time-consuming process that requires a significant upfront investment. Since data quality is strongly correlated with ML model performance, it’s crucial to invest in high-quality solutions for data annotation from the onset of your project. The process of annotating data for an ML project can be surprisingly complicated – There are a variety of factors you’ll need to consider, from how you’ll manage your approach to annotation to what tooling and automation approaches you’ll use.
How to Build Your Data Annotation Pipeline
Annotating data at high quality requires significant upfront and continued investment over the lifespan of your ML system. While there are automated solutions to speed up and streamline the data annotation process, having humans-in-the-loop is key to producing high-quality results. When setting up a data annotation pipeline, your company will need to decide how you’ll find and train annotators, as well as how you’ll measure their performance. To ensure annotators produce high-quality annotations, you’ll need to provide them with performance incentives that align with your most important annotation metrics, whether that’s throughput or quality, or a combination of both. You’ll also need to provide annotation tools for your workers, as well as design a pipeline that encourages your annotators to output consistent, high-quality annotations.
Hiring Annotators While Minimizing Overhead
Finding annotators to complete your task is one of the first challenges you’ll face in the data annotation process. You’ll need to identify workers who are well-suited to annotate your data, whether that involves selecting people who speak a certain language, are from a specific location, or have a particular type of expertise. You’ll also need to decide how to approach the hiring process. Your choice of workforce influences whether you have access to a large selection of annotators who work fewer hours, or a small selection of annotators who work longer hours and generate more consistent results. If you expect a continuous flow of production-quality data, you’ll also need to potentially hire workers across time zones to meet the demands of your project.
You should decide whether to hire a larger pool of diverse annotators versus a smaller set of consistent annotators based on your use case. If you anticipate that you’ll need annotations for small, irregular projects, for example, you’ll likely want to rely on a larger selection of annotators that can accommodate short-term surges in need. Alternatively, if you’re pursuing a long-term production project with consistent annotation needs, a smaller team of annotators with regular annotation hours may deliver better results.
Once you’ve hired annotators, you’ll also need to devote resources to communication and expectation setting to ensure that your annotators provide you high-quality results. In short, just finding workers to annotate your data and keeping them on your platform can be a time-consuming process with logistical overhead! These challenges are often why many ML teams opt to outsource the entire process to an external company.
Training Annotators and Monitoring Performance
Once you’ve hired your annotators, your work is not yet over – You need to provide training to teach them to accurately annotate your data across different use cases, task types, and annotation types. To ensure high-quality annotations, you’ll need to provide workers with clear and comprehensive instructions, so that different workers can annotate data consistently. You may also find over time that there are several edge cases within your data that your instructions do not cover, as it can be difficult to write non-ambiguous and comprehensive instructions that cover all potential edge cases.
As the administrator of the data pipeline, you’ll need to create a “golden dataset” that has accurate annotations for a subset of your data. You can use this dataset to check that workers understand the instructions, as well as to periodically re-evaluate their performance against a gold standard. You should also regularly update this dataset with new examples to provide annotators with up-to-date examples that reflect real-world data.
Suppose your goal is to achieve 95%+ quality in your overall dataset. If all your annotators consistently achieve 95% quality, you should be able to achieve this goal. However, if there is a particular edge case that only appears 1% of the time, your annotators can get 100% of that 1% wrong and still average 95% quality overall. While you are still hitting your goal of 95%+ quality, this is not ideal, as your model will not be able to account for this edge case if trained on this dataset.
To mitigate this challenge, you will need to add examples to your golden dataset to ensure that it remains well-balanced across all classes and edge cases. By updating and adding new examples to your dataset, you can effectively assess annotator performance over time, and achieve a holistic view of their performance across various classes and edge cases.
Reviewing Annotations for Quality
After your annotators complete tasks, you’ll need to evaluate the final output. To do this, there are two main methods for evaluating annotation quality. For classification or text-based tasks where there is more subjectivity, you can implement a consensus algorithm. Suppose you want to develop a sentiment analysis model, you can have multiple people determine if a person in an image looks happy or sad and the option the majority selected will win. If there isn’t enough confidence in the voted result, you can have a higher-level reviewer determine the final result. In some cases, you may also not want to weigh every person’s vote the same. If you have higher confidence in some people’s votes over others, you may want to implement a weighted voting system that biases the final response more towards the votes of people you have higher confidence in.
For image-based or other computer vision tasks, you can have one preliminary attempter annotate the task, then have a higher-level reviewer either review the work and accept the results or review the work and fix annotations. You can select reviewers from past annotators who consistently produce high-quality results. Selecting reviewers, however, can be complicated, since someone’s annotation performance can fluctuate over time. You’ll need to constantly identify reviewers who are no longer performing well so that you can demote them to attempts, while also identifying well-performing annotators so that you can promote them to reviewers. Since this approach removes high-quality annotators from the labeling pool, it will naturally reduce your throughput, so you’ll want to find a balance between quality and throughput.
Whatever approach you follow for evaluating label quality, you’ll need to be on the lookout for people who consistently underperform and label data poorly. You should automatically identify underperforming annotators, and either retrain them so they produce higher-quality labels, or remove them from the project entirely.
Selecting the Best Annotation Tool For Your Task
Another big decision you’ll need to make is on what labeling tools your workers will use. You can either build your own tool set or use available commercial tools. While you can specialize in-house tools for your specific task, they can be difficult and time-consuming to build, especially if they need to support different annotation types. For example, image, audio, and text data all require different annotation approaches – For image annotation, you’ll need a system to draw boxes, polygons, or other geometries around identified objects; whereas, for audio annotation, your system will need to support transcription and associating timestamps with event attributes. For classification tasks, on the other hand, your system will need to permit multiple-choice selections and potentially large taxonomies. For video annotation tasks, video files can be very large in size, making them technically challenging to support, especially if your workers have less powerful laptops or slow internet connections.
Compared to in-house tooling, commercial data tools make it easier to jump into labeling your data, although they may be less specialized for your use case. Another consideration when selecting the right annotation tool is ML-assisted tooling. Some tools may have ML-assisted tooling, like our Autosemseg tool, to improve annotation efficiency and speed while others do not. Building and selecting the appropriate labeling tools is vital for optimizing labeler efficiency, so you should ensure the tools you select are well-suited for your use case.
What Next?
Needless to say, getting data annotation right can be a challenging and time-consuming process. You’ll need to consider a wide range of factors, from how you’ll manage your labeling workforce to what automation approaches you’ll use. Your data labeling pipeline is a process that’s incredibly important to get right since ML system performance is correlated with data quality.
Instead of building a data pipeline from scratch, you can leverage a product like Scale Rapid or Scale Studio to quickly create production-level ratings. It’s designed so that you can set up your projects in minutes and get high-quality labels within hours. Trusted by AI teams like OpenAI, Square, and Pinterest, this tool can help you hit the ground running on your next ML project!