Automated testing of machine learning models can help to dramatically reduce the amount of time and effort that your team has to dedicate to debugging them. Not applying these techniques properly, however, could potentially do a great deal of harm if the process is completely unmonitored and doesn’t adhere to a list of best practices. Some tests can be applied to models after the training stage is over while others should be applied directly to test the assumptions that the model is operating under.
Traditionally, it’s proven somewhat difficult to test machine learning models because they’re very complex containers that often play host to a number of learned operations that can’t be clearly decoupled from the underlying software.. Conventional software can be broken up into individual units that each accomplish a specific task. The same can’t be said of ML models, which are often solely the product of training and therefore can’t be decomposed into smaller parts.
Testing and evaluating the data sets that you use for training could be the first step in unraveling this problem.
Monitoring Data Sets in a Training Environment
ML testing is very different from testing application software because anyone performing checks on ML models will find that they’re attempting to test something that is probabilistic as opposed to deterministic. An otherwise perfect model could occasionally make mistakes and still be considered the best possible model that someone could develop. Engineers working on a spreadsheet or database program wouldn’t be able to tolerate even the slightest rounding errors, but it’s at least somewhat acceptable to find the occasional flaw in the output of a program that processes data by way of learned responses. The level of variance will differ somewhat depending on the tasks that a particular model is being trained to accomplish, but it may always be there regardless.
As a result, it’s important to at least examine the initial data that’s being used to train ML models. If this data doesn’t accurately represent the kind of information that a real-world environment would thrust onto a model, then said model couldn’t ever hope to perform adequately when such input is finally given. Decent input specifications will help to ensure that the model comes away with a somewhat accurate representation of natural variability in whatever industry it’s performing a study in.
Pure performance measurements can come from essentially any test set, but data scientists will normally want to specify the hyperparameters of their model to provide a clear metric by which to judge performance while taking said measurements. Those who consistently use one model over another solely for its performance on a particular test set may end up fitting test sets to models to find something that performs exactly as they want it to.
Those who are working with smaller data sets will need to find some way to evaluate them in spite of their diminutive size.
Managing a Smaller Set of Data Safely
Those working with particularly large data sets have typically gone with 60-20-20 or 80-10-10 splits. This has helped to strike a decent balance between the competing needs of reducing potential bias while also ensuring that the simulations run fast enough to be repeated several times over.
Those working with a smaller data set might find that it simply isn’t representative enough, but for whatever reason it isn’t possible to increase the amount of information put into the test set. Cross-validation might be the best option for those who find themselves in this sort of situation. This is normally used by those in the applied ML field to compare and select models since it’s relatively easy to understand.
K-fold cross-validation algorithms are often used to estimate the skill of a particular ML model on new data irrespective of the size of the data in question. No matter what method you decide to try, though, your team needs to keep the concepts of testing and validation data separate when training your ML model. When you square these off in a training data vs. test data free-for-all, the results should come out something like this:
- Test sets are essentially examples that are only ever used to judge the performance of a classifier that’s been completely specified.
- Validation sets are deployed when data scientists are tuning the parameters of a classifier. You might start using a validation set to find the total number of hidden units that exist in a predefined neural network.
- Training sets are used exclusively for learning. Many experts will define these as sets that are designed to fit the parameters of the initial classifier.
Segmenting testing, validation and training sets might not seem natural to those who are used to relying on one long inclusive data set in order to ensure that their ML models work in any scenario. Nevertheless, it’s vital to have them separated as much as possible. Test data management should always be part of your QA workflows. On top of this, it’s important to keep an eye on how a model responds as it learns from the data even if it does appear that accuracy increases over time. This is because there are several high-quality insights an operator can derive from the learning process.
Taking a Closer Look at Weights During the Training Process
An ideal model will enjoy lower losses and a higher degree of accuracy over time, which is often more than enough to please the data scientists that develop them. However, you can learn more by taking a close look at what areas are receiving the heaviest weights during training. A buggy piece of code could produce outcomes where different potential choices aren’t given different weights. Alternatively, it could be that they’re not really stacked against one another at all. In these cases, the overall results might end up looking realistic when they’re actually errant. Finding bugs in this way is especially important in a world where ML agents are being used to debug conventional software applications.
Taking a closer look at the weights themselves can help specialists to discover these problems before a model ever makes its way out into the wild. Debugging ML models as though they were application software will never work simply because so many aspects of their neural networks come from exclusively learned behaviors that aren’t possible to decompose into something that could be mapped on a flowchart. However, it should be possible to detect certain types of problems by paying close attention to these weights.
Developing any piece of software takes quite a bit of time, and the fact that ML models have to be trained means that they’ll often take even longer. Give yourself enough lead time and you should find that your testing, validation and training sets split evenly into different neat packages that make the process much simpler.