Ensuring Robustness: 6 Tips for Testing Machine Learning Models in Safety-Critical Scenarios

Over the past few years machine learning has continued its steady pace into the mainstream, with companies using it to develop everything from recommendation engines to computer-vision components for self-driving cars. While machine learning holds great promise in its value, it can also pose new risks, especially when used in safety-critical scenarios such as self-driving cars or medical diagnoses. Building machine-learning models for safety-critical scenarios is much more challenging than building regular machine-learning models. In this article, we will explore why this is the case and provide actionable tips on how to test and evaluate models for safety-critical use cases.

The high stakes of safety-critical use-cases

The biggest challenge of constructing machine learning models for safety-critical scenarios lies in the severe consequences of any mistakes or errors on account of the model. A misinterpretation of a stop sign as a yield sign by a self-driving car, for instance, could lead to a life-threatening accident. In contrast, if a recommendation system offers an unsatisfactory movie to watch, the worst outcome is minor disappointment and some potential boredom. The high stakes involved in safety-critical scenarios mean we must conduct a much more rigorous testing and evaluation process.

Modern car AI evaluating driving conditions

Evaluating machine learning models: beyond test-set performance

Standard machine learning evaluation best practices have us split our datasets into three distinct subsets.

the first subset is used for training the model,

the second for fine-tuning and validation,

and the third for assessing the model’s expected performance in real-world scenarios referred to as the test set.

However, it is important to recognize that good performance on the test set does not necessarily guarantee good performance in the real world. There are some important assumptions underlying that statement, and it is crucial to understand them. Firstly, the test set’s performance reflects the model’s expected performance in the real world only if the data distribution in both settings is the same. Secondly, the model’s performance in the real world will be similar to the test-set performance only on the specific metrics that we have chosen to measure.

These assumptions made for test set evaluation simply do not hold in the real world. In practice, production data is rarely drawn from the same distribution as the test set data. Furthermore, test-set metrics can be too broad and fail to consider important edge-cases and data slices that are critical to the specific business case. It is also possible that test set metrics do not align with the actual business requirements. Lastly, relying solely on test set evaluations does not provide any information about the model’s security and vulnerabilities from an adversarial machine learning perspective, which is a critical consideration for safety-critical scenarios.

Comprehensive evaluation techniques for safety-critical machine learning models

This highlights the need for more comprehensive evaluation techniques. Here are some practical tips for conducting more rigorous testing and validation of safety-critical machine learning models.

1. Set clear performance metrics

Before testing, it is essential to develop clear and relevant performance metrics that align with the model’s intended use case. These metrics should be measurable, quantitative, and reflect the key performance indicators that the model is expected to deliver in the real world. For example, if the model is used for recognizing traffic signs, the performance metric could be the percentage of correctly recognized signs, along with measures of false negatives and false positives.

2. Use multiple evaluation metrics

Rather than relying on a single metric, it’s important to use multiple evaluation metrics that fully measure and reflects the model’s intended use cases. In the case of automotive and self-driving use cases, some additional evaluation metrics that could be considered include the reaction time of the model, its ability to detect and respond to unexpected obstacles or road conditions, and its performance in complex scenarios such as intersections or highway merging.

3. Test under a variety of scenarios and conditions

Once you have established your performance metrics, it’s important to rigorously test your model on those metrics under a variety of scenarios and conditions, including both normal and edge cases. Edge cases are scenarios that fall outside the typical training data, and require the model to make predictions based on extrapolation.

Testing the model’s performance in edge cases is critical to ensure that it can handle unexpected situations in the real world. It’s also important to test the model under a variety of conditions that mimic real-world scenarios. This can include varying environmental conditions such as lighting, weather, or poor road conditions, as well as other factors such as different user behaviours or system inputs.

By testing the model under diverse conditions, you can identify potential issues and ensure that the model is robust enough to handle a range of scenarios.

4. Conduct adversarial testing

Adversarial testing involves testing the model’s performance under intentionally designed malicious inputs that aim to deceive or exploit the model. This is an essential step to ensure that the model is robust against adversarial attacks. For example, in the context of self-driving cars, an attacker could place stickers or graffiti on a stop sign to make the model misinterpret it as a yield sign. Adversarial testing would involve testing the model’s performance in identifying such manipulated signs and ensuring that the model correctly identifies them as stop signs.

5. Test continuously

Conduct testing throughout the model’s lifecycle: Testing and validation should not be a one-time event but rather an ongoing process throughout the model’s lifecycle. As the model is deployed and used in the field, it is important to continue testing and validating its performance, monitoring for potential issues, and identifying areas for improvement.

6. Automate

To facilitate regular and effective testing, it is recommended to automate as much of the testing process as possible. Automating testing helps to eliminate friction and ensures that tests are run consistently and frequently. This, in turn, enables models to be evaluated and validated rigorously throughout their lifecycle, as mentioned previously. As the model is deployed and used in real-world settings, regular testing and validation are critical to ensure that the model is performing as expected and that any issues or areas for improvement are promptly identified and addressed.

Safety-critical model testing at NavInfo Europe

GuardAI is a comprehensive platform developed by NavInfo to aid in the testing of safety-critical machine learning models developed in-house. By implementing, among many others, the evaluation techniques described above, our engineers can focus on developing and improving models while ensuring that the models are rigorously tested for safety and performance. GuardAI is currently available for beta testing, and interested parties can subscribe for free on GuardAI the landing page. With GuardAI, NavInfo is committed to ensuring that our safety-critical machine learning models meet the highest standards of safety and performance, and we are excited to share this solution with the wider community.