illustration of cuboids

What is Training Data?

In traditional software development, developers explicitly specify instructions for a computer to follow to produce outputs from inputs. When using machine learning (ML), instead of explicit instructions we provide the ML model with examples (pairs of outputs and inputs), which the model imitates trying to produce the same outputs from the corresponding inputs. These examples are called Training Data.

How do I get good Training Data?

The quality of training data has a large impact on ML model performance. To get the most out of your models, be sure to:

  • Annotate for Specific End Applications:

    When developing training data, it's not enough just to collect vast amounts of data. The data also must be annotated with an end application in mind (ex: training data for self-driving cars versus training data for retail).

    This is important, becasuse you must understand how the mistakes, inconsistencies or errors in your data interact with the implicit assumptions of your model and to decide how to compromise between different errors (false positives vs false negatives, outliers).

    Scale not only supports a standard group of labels (e.g. pedestrians, cyclists, cars) but labels can also be customized to support any use case.

  • Handle Rare Cases

    There are two types of rare cases: expected and unexpected

    1. Expected: Say you are developing training data for autonomous vehicles, and an emergency vehicle shows up 1% of the time. This is a rare case, but should be expected. Emergency vehicles are a known type of vehicle that are occasionally seen on the road.
    2. Unexpected: Say again you are developing training data for autonomous vehicles, and your city's streets have recently been flooded by last-mile electric scooters. The appearance of these scooters is sudden, and was not expected when you began creating your dataset. Models pay attention to scenarios in proportion to how often they show up. Rare cases, however, are disproportionately important and should intentionally be upsampled in training data. This is best done by upsampling the raw data to balance the dataset.

    Our taskers are highly trained and specialized (particularly for autonomous vehicles and robotics) and will escalate rare cases to better balance datasets.

  • Balance Bias against Variance (Random / Systematic Error)

    1. Variance (Random Error) are statistical fluctuations (in either direction) in data due to the precision limitations of devices. Variance in training data is not ideal, but can be evaluated through statistical analysis and overcome with a lot of training data.
    2. Bias (Systematic Error) on the other hand, are reproducible inaccuracies that are consistently in the same direction. Bias is much harder to detect, and cannot be simply overcome with more training data.

    Our platform will correct for systematic errors.

  • Updating Training Data

    As briefly touched upon by unexpected rare cases, systems are constantly evolving. Models are capable of adapting to these evolving systems as long as they are provided with updated data. As such, to have high performing models, training datasets also need to be continuously updated to deal with new scenarios.

Where Does Scale Fit into This?

Developing high quality training data is a challenging problem. Certain annotation types (e.s. LiDAR Annotation) are complex, and difficult to do without developing specialized tooling and making significant investments to train people. Scale's suite of annotation tasks are capable of supporting a wide variety of end points, and allows ML engineers to focus on the more impactful and differentiated work of developing models rather than worrying about how to get their data annotated.

For more on specific use cases, take a look at our: