Presentation
Enabling SE for AI with Test and Evaluation Harnesses for Learning Systems
Publication Date: 9/22/2022Start Date: 2022-09-21
End Date: 2022-09-22
Event: AI4SE & SE4AI Workshop 2022
Event: Stevens Institute of Technology, Howe Building, Hoboken, NJ
Lead Authors:
Dr. Tyler Cody
There is an increasing demand for operational uses of machine learning (ML), however, a lack of best practices for test and evaluation (T&E) of learning systems is a hindrance to supply. This presentation shares a new framework for best practices, described as T&E harnesses, that corresponds principally to the task of engineering a learning system–in contrast to the status quo task of solving a learning problem. The primary difference is a question of scope. This manuscript places T&E for ML into the broader scope of systems engineering processes. Importantly, two challenges to existing T&E best practices are used to motivate the use of T&E harnesses for learning systems.
First, regarding acquisition, it is unclear how T&E processes that focus on model accuracy (or related metrics) on held-out data are able to give assurance that the model will achieve needed outcomes during operation. Needed outcomes are typically identified as part of a need analysis phase conducted prior to developing a ML solution. The need analysis phase occurs at the beginning of the systems engineering “V” process, whereas the ML solution is a component, developed later at the bottom of the “V”, antecedent to subsystem-, and system-level integration. Current best practices hold-out data during component-level engineering to perform T&E. From this perspective, current concepts of T&E narrowly scope themselves to component-level testing, implicitly assuming that if an ML solution meets its component-level functional requirements, then it will provide needed outcomes after aggregation with the rest of the system. T&E harnesses tether ML solutions to the systems (e.g., platform, mission) wherein they operate, ensuring that evaluations of performance are contextualized in terms of the broader system and environment (e.g., their state, behavior, structure, etc.).
Second, regarding operations, the no free lunch theorems of statistical learning theory suggest that no single model can be optimized for all conditions at once. And so, the concept of T&E as an (dominantly) pre-deployment activity is in conflict with the first principles behind ML solutions. That is, as conditions change, e.g., between operations, or as platforms degrade, or with changes in use, if there is a material difference in the data that flows through the model, then the performance of the model is expected to change. This suggests that domain adaptation is the rule, not the exception. Conversely, so-called universal models, e.g., general purpose vision models, are the exception, not the rule. T&E harnesses provide a construct for coordinating continuous T&E and continuous re-engineering of ML solutions.
This presentation draws from recent findings in experimental design for ML, combinatorial interaction testing of ML solutions, and the general systems modeling of ML. The concept of T&E harnesses is closely tied to existing models of systems engineering processes. We draw the conclusion that existing best practices for T&E form a subset of what is needed to rigorously test for system-level satisfaction of stakeholder needs.