As ML applications grow in size and complexity, testing is essential to ensure integrity and reliability of the application at hand. Given the non-deterministic nature of machine learning (ML) models, testing and continuous evaluation of ML systems is even more important.
That’s why today, we’re thrilled to announce Early Access to Scale Validate, our testing solution for ML models. The predictive performance of ML components is essential to the overall success and adoption of whole products. Progressively developing a high-performing ML-based system is not trivial, because it typically requires continuous iteration and experimentation. During these iterations, engineers need to identify new edge cases and improve model performance on those. With every new model iteration, ML engineers want to be sure that they aren’t introducing any performance regressions on mission-critical subsets of data.
Using Scale Validate, you can now define data-centric tests for your ML models to avoid performance regressions during development. This step helps ensure that only the best ML models make it into production.
For ML-based products, model performance should always directly correlate with positive business outcome. The performance of ML components is often crucial to the success and adoption of whole products. Getting to a high-performing ML-based system is not trivial but typically requires many iterations and will almost never be done. During these iterations, engineers need to identify new edge cases and improve the model performance on those. With every new model iteration, ML engineers want to be sure that no performance regressions on mission-critical subsets of data are introduced.
While ML model testing is key to fast and robust model iteration, there are a few common issues that we identified with existing testing and evaluation pipelines for ML models:
Aggregate evaluation metrics miss critical regressions
A common way of validating ML models is to compute aggregate scores and metrics (e.g. mAP, precision, recall, IoU, …) across fixed test/validation datasets (commonly understood as model evaluation). Evaluation with aggregate metrics is necessary but not sufficient for robust models because they often miss failure modes that are underrepresented due to data biases. Users won’t be able to track regressions with explicit checks on such failure modes that were previously addressed but are failing now.
Coverage for ML model testing is not well defined
While we can analyze covered lines of code in software testing, there’s no such number which informs about the coverage of tests for ML-models. Lines of code still need to be covered, but more importantly, tests need to cover the operational domain of the model input data. The high dimensionality of typical inputs (e.g. an image has more than a million pixels with 3 channels) does not allow for realistic or efficient input-fuzzing approaches.
Testing pipelines are static and don’t evolve with the models
Building ML models is an iterative process. So is testing ML models. Not adapting testing and validation datasets results in vanity metrics which in turn oftentimes overestimate the confidence in the models. Continuously evolving test definitions and adding the most tricky examples is crucial to continuously improve the test coverage in the operational domain. This is the only certain path to confidently deploy robust models to production systems.
Accordingly, we built Scale Validate with the following capabilities:
Setting up unit and integration tests is the go-to practice to ensure correct behavior of codebases. For ML systems, unit tests are designed to test the correctness of individual model components, which is necessary for a reliable model performance. Scenario tests are designed to test the actual model performance and decision-making capabilities of ML models on mission-critical scenarios in the operational domain. Scale Validate allows users to easily define scenario tests on subsets of data (example), using off-the-shelf or custom evaluation metrics. New models are automatically tested against defined scenario tests. Well-defined scenario tests allow ML engineers to catch regressions before deploying a model to production. When a test fails, users immediately see in which scenario the required performance is not achieved anymore (example).
Model performance rarely boils down to a single metric. With Scale Validate, it’s trivial to compare the performance of new models against baseline models (example). You can use either our standard performance metric tracking across scenario tests or our side-by-side model comparison. All charts are fully interactive and users can start by informing themselves from higher-level metrics, then graduating into learning from item-based analysis.
Scale Validate enables users to automatically identify coherent and underperforming slices of data that do not yet have test coverage. Users can be sure they are always testing against a representative test set.
We built Validate as a developer-friendly solution both for API-first and UI-focused users. The Validate interfaces and endpoints make it as easy as possible to get started and integrated into existing pipelines. Today, Scale Validate is used for the testing pipelines of leading ML teams in areas including autonomous driving, surveillance, and industrial robotics.
As Pickle Robot’s CTO, Ariana Eisenstein, explains:
Getting started with Validate only takes a few steps!
- Sign up / log into the Scale dashboard
- Explore Validate with public datasets
- Upload your data
- Upload models and annotations
- Compare your models in our dashboard
- Select interesting slices of data and define tests
Without a detailed model evaluation and testing pipeline, you’ll never know which performance points you’re leaving on the table and which performance regressions might sneak into your models. Instead, ensure that your model quality only increases with every model and dataset iteration. Test-driven modeling will ensure that you train a production-grade model faster! If you’d like more information about Scale Validate, reach out to us for a detailed demo and discussion here.