Evaluating Recommenders
While it's important to operate a stable, scalable recommender system, it's even more important that the system produces good recommendations. Myrrix provides basic tools, as does Apache Mahout, for evaluating the quality of recommendations produced by a recommender.
Note that evaluation support available in version 0.6 and beyond remains experimental and subject to change.
Background
Evaluation would be simple if it were already known what recommendations are good, given a set of data. But, that is exactly what the recommender is supposed to do; if these answers were already known, the recommender would be unnecessary. Instead, as in most machine learning evaluations, a small part of all data is held out, and the remaining data is sent to the recommender to train it. The held-out test data are taken to represent "good" recommendations, and the test is intended to find out how well the recommender system returns these good recommendations. Some tests select most-associated items from each user as the "good" test data; some will select most recent data as test data.
This is not a perfect test, since the test data held out from a user may not actually represent best recommendations. For example, in the classic context of recommending movies, imagine a user that has viewed three Star Wars films and then Shrek, an animated comedy. The Shrek data point might be held out as the test datum. Given only that the user views Star Wars movies, a recommender would likely suggest more films in that series. But the resulting evaluation will essentially ask whether the recommender comes up with Shrek -- unlikely. In spite of these issues, evaluation schemes that follow have value, and can be used to compare implementations relative to each other.
Area Under Curve (AUC)
Area under curve is related to receiver operating characteristic (ROC). ROC, in turn, as applied to recommenders, reveals the proportion of good recommendations are returned as the number of recommendations increases. AUC is a summary statistic which may be interpreted as the probability that the recommender would rank a random good item recommendation from the test set higher than a random item. It is between 0 and 1, and larger is better, and ought to be above 0.9.
To run an AUC evaluation test, add the myrrix-online module to your project, by adding myrrix-serving-x.y.jar to your classpath. Prepare a directory containing (possibly compressed) CSV files of recommender data. This can be the same directory a real stand-alone Serving Layer is running from. If this directory is at "recommender-data", then to run an evaluation using 50% of all data, and using 90% of that as training data (leaving 10% as test data), execute Java code like:
import net.myrrix.online.eval.*;
...
AUCEvaluator evaluator = new AUCEvaluator();
EvaluationResult stats = evaluator.evaluate(new File("recommender-data"), 0.9, 0.5);
System.out.println(stats);
This will eventually print a value like "0.945", depending on your data. This is the AUC statistic. Higher is better.
Precision and Recall
Precision and recall are well-known statistics in information retrieval. Precision measures the proportion of recommendations that were good, and recall measures the proportion of all good recommendations that were recommended. (The test framework will actually request a number of recommendations equal to the number of all good recommendations; this makes precision and recall equal for purposes of this test.)
Precision and recall tests tend to suffer from the problem of knowing good recommendations before-hand. Precision and recall can be low even for good recommenders. Higher is better. To run this evaluation, follow the code above, using PrecisionRecallEvaluator instead.
Estimated Strength
A third, less common type of evaluation metric is provided for completeness. The recommender can estimate a strength of association for any item. A good recommender ought to estimate a value close to (or above) 1.0, as they are "good" recommendations. So, it is also possible to measure the average difference between 1.0 and the estimated strength value over test data -- just the average of "1 - estimate". This quantity can be negative, note. To try this out, use EstimatedStrengthEvaluator. Here, lower is better.
