Real-Time, Or Big Batch? Yes.
We often must choose between speed and scale: handle lots of data, or get an answer quickly, but not both. Today, to keep up with "Big Data", we resort to massively parallelized batch-oriented infrastructure like Apache Hadoop™ to finish computations in hours instead of years. And yet, we still need instant answers to user requests. Recommender engines exemplify this need to have it both ways: recommendations must update to reflect new ratings and actions immediately. Yet, the underlying machine learning models may need thousands of hours of computing time to fully update from new data.
Apache Mahout™ already provides popular infrastructure for building real-time recommender engines for small- and medium-sized data via the Taste subproject. Separately, it provides infrastructure for computing recommendation and clustering models over massive data sets in batch using Hadoop. But both elements are needed at the same time, in one cooperating system -- not separately. Myrrix has both, as two cooperating layers:
- a Serving Layer, which answers requests, records input and provides recommendations and clusters in real-time, and
- a Computation Layer, which does the heavy-lifting in the background to update the machine learning model for the Serving Layer, using parallelized machine learning algorithms
Myrrix brings those two elements together into one complete, ready-to-run system. Its Serving Layer is an HTTP server with a REST-style API, and can receive updates and compute updated results in miliseconds. It also communicates with the Computation Layer, a series of jobs that run on a Hadoop cluster, to run the larger-scale machine learning algorithms and fully update the underlying machine learning models periodically.
Your Place Or Ours?
Each of the two layers operates independently, and can be used by itself.
The Serving Layer may be run on its own, in stand-alone mode, on one machine. Without a Computation Layer, it will attempt to run machine learning algorithms locally. While this will not scale to very large data sets, or provide fault tolerance, it can be a simple and effective solution for small- and medium-sized applications. The Serving Layer itself is open source and freely available to be run in this way.
The Computation Layer may likewise be run on its own. It is a series of Hadoop jobs, which receive input and write output to a distributed file system like HDFS or Amazon S3. For applications that do not need a real-time serving component, but must process a large amount of data, the Computation Layer may be a solution by itself.
But the two are best when combined, and run in distributed mode. Many instances of the Serving Layer are deployed to handle a large volume of real-time traffic, and all cooperate with a Computation Layer to perform the large-scale machine-learning computations.
Because both layers are built on standard technologies and components, both are easy to provide as software to run inside your data center or cluster, or as a hosted platform-as-a-service from Amazon Web Services and Myrrix.
Continue to download Myrrix.
Reality does not often fit into the idealized models of clustering and recommender theory. Myrrix takes care of many pesky details, out of the box:
- No Ratings. Recommenders have historically used explicit item ratings; for example, Netflix invites users to rate movies on a scale of 1-5 stars. However, ratings can be "noisy", and, are not usually available. Myrrix employs a generalized model that can ingest any event: clicks, views, and so on -- even ratings.
- Cold Start. Some systems struggle to create recommendations for a new user until a certain amount of data is available, or after it has updated a model. Recommendations can be made for a user immediately after the first data point for the user is fed into Myrrix.
- Temporary Users. Sometimes users need recommendations, but are not previously-known registered users. Myrrix provides special support for creating recommendations for users whose history is not available in the model.
- Non-numeric Data. Unlike in Mahout, user and item data need not be numeric. "Jane" is an acceptable user identifier.
- Privacy & Security. Real user and item data need not be sent to the server; the client may send opaque hashes instead.
Myrrix deploys a recommender engine technique based on large, sparse matrix factorization. From input data, it learns a small number of "features" that best explain users' and items' observed interactions. This same basic idea goes by many names in machine learning, like principal component analysist or latent factor analysis. Myrrix uses a modified version of an Alternating Least Squares algorithm to factor matrices. The essential elements of this approach are explained in, among other resources:
- "Collaborative Filtering for Implicit Feedback Datasets" by Hu, Koren and Volinsky
- "Large-scale Parallel Collaborative Filtering for the Netflix Prize" by Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan
- Alex Smola's Scalable Machine Learning course notes, Lecture 8, Section 2 from slide 34 in particular.
The algorithm is attractive because:
- Its input is not (necessarily) explicit rating data. It can be fed information about actions, like user clicks and page views.
- It can be computed in parallel, which makes it amenable to efficient computation on Apache Hadoop.
- It has strengths of latent factor models and can discover hidden attributes that connect, for example, pairs of items that have never been observed to go together.
- The algorithm also lends itself to approximate real-time updates. While the Computation Layer must periodically trigger a large computation to completely learn from the new data, the Serving Layer can still perform approximate learning on new data by exploiting certain properties of this algorithm.
- It is nearly immune to the "cold start" problem and can provide quality recommendations for very new users or items.