While the recommendations, similar items, and so on from Myrrix should be useful and meaningful on a wide range of data "out of the box", their quality an typically be improved with tuning. This includes tuning of the machine learning algorithm itself, tuning the input, and improving results by adding business rules.
Tuning the Machine Learning Algorithm
The quality of the results returned from Myrrix depends in part on the quality of the underlying matrix factorization that is used to model users and items. The better the factorization models actual users and items, the better the results will be.
The particular algorithm used by Myrrix, called ALS, has two parameters to tune: number of features, and λ (lambda). These control the complexity of the model and its resistance to overfitting, respectively. It is not necessary to understand them to tune them. The best values will depend on the nature of the data fed into Myrrix.
Fortunately, the Serving Layer software contains a program that can test several combinations of values for these two parameters, and report a best value:
ParameterOptimizer (from version 0.11).
To use it, export a representative sample of real data as one or more CSV files and collect them in a directory. Because this process runs on one machine, this may need to be a sample of all data in order to fit. Hundreds of thousands of users and items, and millions of data points, is plenty to test on. The optimization can take a long time to run, and too much data makes it much longer.
Optimizing the two values above means finding good settings of Java system properties
model.als.lambda. The test tries a series of values for each within a given reasonable range, in all combinations, and reports the values that maximize some measure of result quality. Internally, part of the input data you supply is set aside -- the strongest user-item interactions in the data. It then creates a recommender without this data, and measures how often the held-out data items appear in the results. This is known as precision and in particular the test optimizes for mean average precision.
Data has been collected in the directory
test/data. 10% of all the data will be used. The optimizer will try 3 different numbers of features between 20 and 150, and will try 3 values of λ from 0.0001 to 1:
java -Xmx4g -cp myrrix-serving-x.y.jar net.myrrix.online.eval.ParameterOptimizer \
test/data 3 0.1 \
After running for potentially hours, it will report the best combination of values for these parameters. Maybe it is 100 features and λ of 0.1. To then use these parameters with the Serving Layer, add to the command line:
java -Dmodel.features=100 -Dmodel.als.lambda=0.1 ...
Each input datum describes a connection between a user and an item, with an optional strength value. This makes the input more expressive, as it is possible to indicate that some interactions are more important than others. This in turn makes it feasible to combine different and potentially quite different types of data (purchases in addition to page views for example). The strength value is additive, and sums over all data with the same user and item; this makes it possible to incrementally learn about the strength of a user and item association.
However it introduces new questions: how should different types of input be weighted? There are no absolute answers. The model result is not extremely sensitive to the exact choice of weights. A few principles should define a reasonable choice of weights for most use cases. Any choice can be tested with evaluation, as above.
- Relative weight values should reflect relative frequency. A model might combine both page views and purchases. If there are about 200 times as many page views as purchases overall, then a purchase may reasonably be weighted 200 times higher than a page view. This can establish relative weightings for most input types.
- Relative weight should also reflect business value. For example, purchase of an item worth $100 is probably 10 times more significant to the business than a purchase worth $10. These purchases could be weighted by the value of the purchase. This can be combined with the above: a page view for a $100 item should be weighted 10 times more than a view of a $10 item. The weight might be price times overall view-to-purchase conversion rate.
Adding business logic with a Rescorer
Sometimes it is necessary to post-process and modify the answers returned from the recommender. For example, it's not helpful to recommend to buyers a product that is currently out of stock. Or, it may be useful to artificially boost the ranking of products that are currently be promoted.
These sorts of business rules can be implemented outside of Myrrix. For example, it is easy to retrieve recommendations and then manually filter them, or re-rank them by changing scores. This usually requires requesting more recommended items than will be needed, since filtering and re-ranking could make an item further down the list a top result. In many cases this will be a simple and perfectly reasonable way to filter and rescore results.
However, this can also be done inside Myrrix, by using the
Rescorer interface. Users of Mahout will recognize this interface, which is extended in Myrrix. While this is a more complex way to integrate business logic, it can potentially be faster. Since filtering and rescoring are done on the server side, it's possible to avoid having to request and return excess results that won't be used.
Imagine making recommendations for a user from a universe of 5 items: 101, 102, 103, 104, 105. Imagine that the estimated strength for each of these items is, ranked in order:
102,1.0 105,0.9 104,0.8 103,0.4 101,0.3
We need the top 2 recommendations for this user. Calling
/recommend would return the first two items, of course.
Now, imagine that items 104 and 101 are being promoted, and the business has chosen to try to model this simply by increasing these items' scores by 20%. Requesting and rescoring the top 2 items alone would not discover that item 104, with new score 0.96, has become the new second-best recommendation. But, if the rescoring were done on the server side, in Myrrix, then the top 2 results would be 102,1.0 and 104,0.96 as expected.
To enable rescoring, implement the interface
net.myrrix.online.RescorerProvider. This is a factory class which produces
IDRescorer objects that can filter and rescore the results of one request.
Rescorer implementations can be returned for different methods. The
RescorerProvider can produce different implementations for each of the following:
Not all of these must be implemented; extend
AbstractRescorerProvider and then only override methods that need rescoring and filtering logic.
Each factory method receives zero or more
String arguments. These are values passed with the request using
rescorerParams=…. These may be used to parameterize the implementations that are returned for the specific request.
Implementations returned from
RescorerProvider are all
IDRescorers, which operate on an item ID (except
getMostSimilarItemsRescorer(), which operates on a
LongPair: two IDs). They implement
isFiltered(), which decides whether an item should be removed from consideration even before computing a score. They implement
rescore() as well, in which an original value may be transformed according to some logic. The returned value is then used to rank the item in results.
It should be noted that implementations of these methods must be fast because they will be called for every candidate item. It is necessary to preload any relevant information into memory and make it available for very fast lookups and decisions at runtime. Requesting from a web service in real time, for example, will be too slow. Consider using
ReloadingReference as a simple way to periodically reload a data structure from its source.
Deploying a RescorerProvider
RescorerProvider is implemented, it must be deployed in the Serving Layer. The class file or JAR containing it must be added to the server class path. The usual
java command must be modified to specify an additional jar file:
-cp myrrix-serving-x.y.jar:your.jarto the command line
Then, the flag
--rescorerProviderClass is set to the fully-qualified name of the implementation class. Note that several implementations can be specified, separated by commas, in which case all will be applied in the given order.
java -cp myrrix-serving-x.y.jar:your.jar net.myrrix.web.Runner ... \
Or, in distributed mode, if not found in the classpath, it will be loaded from a JAR file found on the distributed file system at