Aggregate metrics are a useful baseline, but as enterprises develop a more robust modeling practice, tools such as “ shadow deployments ,” ensembling , A/B testing, and even scenario tests can help validate models in challenging edge-case scenarios or rare classes. Here we provide the following definitions:

• A/B testing: comparing one model to another in training, evaluation, or production. Model A and Model B might differ in architecture, training dataset size, hyperparameters, or some other factor.

• Shadow deployment: the deployment of two different models simultaneously, where one delivers results to the customer and the developers, and the second only delivers results to the developers for evaluation and comparison.

• Ensembling: the deployment of multiple models in an “ensemble,” often combined through conditionals or performance-based weighting.

Although small, agile teams at smaller companies may find failure modes, problems in their models, or problems in their data earlier than teams at large enterprises, their validation, testing, and deployment strategies are typically less sophisticated. Thus, with simpler models solving more uniform problems for customers and clients, it’s easier to spot failures. When the system grows to include a large market or even a large engineering staff, complexity and technical debt begin to grow. At scale, scenario tests become essential, and even then it may take longer to detect issues in a more complex system.