Any ML developer who has trained an ML model multiple times e.g. with different hyperparameters, and tried to figure out which iteration is better, is already familiar with the basics of model comparison. Before her model can deploy to production, an ML developer needs to be confident that it would perform better than the previous model. However, this is easier said than done. Comparing models with the intent of figuring out which one is better is a complex task. Most ML developers take a rather rudimentary approach, which creates a significant risk of shipping the wrong model.
In this blog, we will go through the core challenges associated with model comparison and discuss strategies to mitigate them. Before we do, though let’s take a quick look at the basic approach adopted my most ML developers:
Let’s say we have recently trained two versions of an object detection model with a different learning rate and now want to find out which one is better. Here’s what that process looks like for most ML engineers:
Choose a standard aggregate metric e.g. mAP (mean average precision), F-1 score, precision or recall
Apply non-max suppression to reduce the number of candidate predictions, ideally leaving the best one accessible and viewable to the user
Match each of the resulting predictions with ground truth using intersection over union
Set minimum IoU and confidence scores thresholds to obtain class confusions
Using these confusions—True Positives (TPs) & False Positives (FPs)—calculate each model’s aggregate metric
The candidate model with the highest aggregate metric score is best and the one you’ll want to deploy
While such a one-dimensional and aggregate comparison is helpful thanks to its simplicity, on its own it’s definitely not sufficient to identify the best model. An aggregate approach obfuscates performance over critical rare cases by assuming all data is equally important. Finding the best model is a complex multidimensional process and more often than not relies on the developer’s subjective opinion. Over the past years, we have worked closely with many ML teams on this topic and have identified strategies to systematically find the best model. It all starts with finding the right metric:
More often than not, ML engineers choose the standard metrics—for example mAP (referenced above) for object detection. However, metric selection is one of the most important decisions when evaluating and comparing models. The metric must capture the nature of the application and the business expectations from the model. For example, if you are working on a cancer detection problem, you would want to detect as many cancer cases as possible even if it means in a few scenarios, you classify a healthy person as having cancer. So “Recall” would be a great metric. However, if you are performing email spam detection then it’s acceptable to miss a few spam emails but not okay to classify a legit email as spam. So “Precision” would be the metric of choice. Typically, you’ll need to make a trade-off between precision and recall. In these situations, tools like the PR (precision-recall) curve or F-scores can help better capture model performance. To get a feel for these metrics, try computing a sample metric on a slice of data in Scale Validate.
Operating point is just another name for input controls which determine confusion calculations e.g. for object detection the operating point is a combination of minimum IoU and Confidence thresholds, if the thresholds are 0.5 then the evaluation logic will only consider objects with >0.5 confidence and >0.5 IoU. So you would want to ensure that these thresholds are set such that your model is giving the lowest number of confusions and the highest performance e.g. the highest F1 score. However, the extent to which you can reduce or increase these thresholds depends on your application. For example, you don’t want a low IoU threshold for object detection because that would mean large errors in location & size of the detected objects. Therefore, you need to find the application constraints first and then find the optimal point for each candidate model within those constraints. To find the optimal point you can traverse over each sensible range of these input controls and look at the resulting performance metric. Once you have the optimal operating points you can do an apples to apples comparison of the aggregate scores. Try varying the IoU & Confidence thresholds of this sample model on Scale Nucleus and see the effects on the confusion matrix and PR curve.
In most instances it is hard to capture model performance with a single metric. For example, for obstacle detection you don’t just want to detect an object but also be very accurate with the location and size. Therefore, on top of an accuracy metric such as recall, you would also want to track average IoU to check for the tightness of the predictions being produced. The challenge of doing model comparison with multiple metrics is that no model might perform the best on all metrics. To tackle this problem, the simplest method is to do a weighted aggregation of these metrics and obtain a single score. This is exactly what the F-scores do for the metrics of precision & recall. If you want to assign equal importance to both metrics you can compute F1 score, otherwise you can use F0.5 or F2 scores to weigh one metric more than the other. You can read more about F-beta scores here. Try model comparison over multiple metrics on Scale Validate.
Real world data is never equally distributed or equally relevant. Some subsets or slices are just more important than others for the model in question. And more often than not these high importance subsets are underrepresented and hence small in size. The way aggregate metrics are calculated, they assume that all data is equally important and obscure performance on rare important samples. For example, a pedestrian detection model can have an mAP of 0.9. But that 0.9 can be very misleading if the examples it fails on are the ones where pedestrians are on a crosswalk! Similarly when comparing two models, the aggregate performance might go up but that doesn’t mean that the new model is better. To address this it is important to first identify these longtail situations and curate small subsets which act as data unit tests similar to unit tests for software. Then models can be compared with regards to how well they perform over these unit tests in addition to the aggregate metric. For example, model B is only better than model A if it improves on the aggregate and doesn’t regress on any of the data unit tests. Read more about data unit tests here and try to compare models on specific unit tests in Scale Validate.