Build vs. Buy: What Makes Sense for Data Annotation?

byon August 15, 2022

All companies, from the earliest startups to the largest Fortune 100 corporations, face a decision as they scale up their artificial intelligence (AI) and machine learning (ML) efforts: How much of it should be built in-house? 

That’s a particularly thorny question when it comes to data annotation. It’s important to get the approach right from the start; studies have shown that an abundance of high-quality annotated data is typically the most significant driver of model performance. But how it’s done, across all steps of the data process — from selecting a data annotation platform, to writing high-quality instructions and taxonomies, and doing the annotation itself — must be based on what works best for each unique organization. 

Here we explore what to keep in mind — the benefits and hidden pitfalls at each step of the process — as you try to solve your own data annotation build vs. buy equation.

Selecting a Data Annotation Platform

Choosing the right data annotation platform from the beginning of the process can minimize complications arising from changing approaches or platforms in the middle of a project.

As you think about a software “home” for your data annotation activities, there are a few key capabilities you should consider: 

  • How easy is it to set up a new project or pipeline for data annotation?
  • How quickly and easily can annotators add high-quality, accurate annotations to your data? 
  • How simple is the process to add annotators to the platform? 
  • Can you analyze the annotation workforce’s performance across dimensions such as throughput, quality, and time spent? 
  • What are the guardrails or in-product capabilities to ensure the data annotation output is as high-quality as possible? 
  • Can you import your model output as “hypothesis annotations” to give your annotators a head-start on their work?

Indexing on these capabilities can help guide your decision on whether to build a platform in-house or purchase a software solution. Once you’ve established your criteria in each of these dimensions, the build-vs.-buy equation comes down to a few ROI calculations:

  1. Customizability: If you’re looking for a platform tailor-made to suit your specific annotation needs, there will be no platform that is more customized than one that you and your team build yourself. This is doubly true if you have annotation needs that are non-standard and have minimal (if any) support from annotation platforms on the market today. The fact of the matter is, commercial tooling caters to the most common use cases, workflows, and annotation types, which may suit the majority of situations, but may not work for you. Some people customize an open source solution (such as CVAT) in order to suit their needs, and others build their own from scratch. 
  2. Predictability: Are your AI/ML efforts predictable over the next few years, or is it likely that you’ll experiment with different data types or annotation volumes? If you expect the scope of any AI/ML project to evolve rapidly, buying a commercial platform can be a prescient choice, since you’ll be able to spin up new data types or annotation types more quickly, provided that they’re supported. If you have a sense of data types you’ll likely experiment with in the future, it’s smart to check if they are supported in whatever annotation platform you consider
  3. Time to market: If speed to production is an important consideration, buying an annotation platform will almost always get you there faster than building one in-house. The one exception is if your data and annotation type are simple, and supported by an open-source solution out-of-the-box. Otherwise, building a custom in-house platform can take months to get the first project in production. That’s an opportunity cost that is often neglected, but which may pose one of the greatest implicit expenses (or loss of revenue) for your organization.
  4. Resources for maintenance: Building an in-house solution will require engineers to maintain and update it to accommodate new experiments or pipelines, and keep up with new standards and approaches. That said, even if you choose a commercial platform, you’ll still likely need to budget some engineering time to set up your pipelines – albeit less than if you’re starting from scratch – especially if you are utilizing API capabilities to import data, create projects, and/or export data. 

In your overall consideration, keep an eye on both short-term and long-term costs. For some smaller organizations, building an in-house tool or customizing an open-source solution “feels” cheaper because there’s no explicit cash outlay. However, you should consider whether the time and expense involved in establishing and maintaining your own data annotation tool is worth it. Future budget demands are more likely to go toward developing and training models than toward maintaining your in-house data annotation platform.

For the largest organizations with custom annotation needs, the calculus becomes more nuanced. In these cases, where the team has a need for steady, large-scale annotation, where having a team of engineers dedicated to building and maintaining an in-house platform could start to make financial sense. That said, if you anticipate annotation needs evolving and quick time-to-market is important, be sure to still consider a commercial solution – at least initially, with an eye towards a potential eventual transition to in-house.

Developing High-Quality Instructions & Taxonomies

In addition to selecting the right platform, you will also need high-quality instructions and taxonomies if you want to train high-quality models. Whether you use your in-house annotation team or outsource to a third party, explaining exactly what you want annotated (along with examples of what to do vs. not do) is critical to getting high-quality data back. 

This is an area where it makes sense to keep in-house as much as possible. You and your team will have the best, most intimate knowledge of what exactly you need annotated, as the team developing the models. By keeping this in-house, you’ll have a tight feedback loop as you scale up. You’d be surprised at how certain instructions are interpreted — and by having your ML team stay tightly involved in workshopping the instructions, providing examples, and adjusting the taxonomies, you can be confident that you're getting the best possible data back.

Choosing a Workforce for Data Annotation

Most teams start annotating data with internal employees who already are familiar with the data and who can start the process part-time. But as the data volume ramps up, be prepared to either recruit new full-time members to the annotation team or to outsource the work entirely to a third party.

Things to consider when making this decision:

  • Cost: Paying the annotation team is not the only consideration here. For a more realistic assessment of the cost of outsourcing data annotation, factor in the time and effort involved in managing an in-house team.
  • Flexibility: Will your data annotation needs ramp up and down quickly? Some third-party providers are prepared to handle such surges with minimal disruptions. An in-house team will need to be kept prepared to jump back in during a scaleback, while still appropriately occupied.
  • Subject matter expertise: While some third-party providers offer teams with specific expertise, most do not. If your annotation work is highly specialized, it may be best to recruit and train your own data annotation experts in-house.

Some organizations choose to utilize a blend of in-house as well as external data annotators. One common workflow is using the external team to do the ‘first attempt’ of annotation, while the internal team reviews the attempts and provides an additional layer of quality assurance. Others utilize the external team to annotate steady production pipelines, and the internal team to work on data for short, bursty experiments.

Conclusion: Buy Some, Customize Others

Ultimately, answering the build vs. buy question doesn’t have to be an all-or-nothing proposition. If your data annotation needs are particularly custom, your budget for in-house development relatively unrestricted, and any future needs relatively predictable, you may be able to develop a great annotation solution in-house. For other teams, commercial software is likely a good choice: it can be deployed rapidly, and you can choose how and when to use its various parts. Depending on your needs and scope, third-party providers can work with your in-house teams to build or modify features and extensions as your AI/ML projects evolve. 

If you’ve considered your options and decided that you’d like to try an annotation platform built by annotation experts, Scale Studio offers best-in-class annotation infrastructure to accelerate your in-house annotation team. Studio also offers a free tier to familiarize yourself with the platform and start annotating data. If you’ve decided you’d like to offload more parts of the data annotation pipeline, Scale Rapid provides an easy way to send your data to be annotated by Scale and get production-quality labels with no minimums.

The future of your industry starts here.