Example: StackOverflow and Computation Layer
This example uses data from StackOverflow, a popular question-and-answer site for software developers. Developers ask and answer questions. To help find relevant questions, anyone may tag questions with short strings that indicate their topic. For example, a question related to recommenders in Apache Mahout might have tags like
StackOverflow publishes its entire database of questions. Here, Myrrix is used not to recommend questions to users (though this is possible). Instead, it is used to recommend tags to questions, to show how recommenders may be deployed when "users" aren't people, and "items" aren't objects like DVDs. For example, this might be useful in suggesting that a question tagged
heap might also be tagged
memory since Android is a Java-based platform and questions about Java's heap concern memory management.
Preparing the data
From the data dump, a simple CSV file consisting of "QuestionID,Tag" was extracted for all questions. This extract contains about 3.5M questions and 30,000 unique tags. This data was uploaded through the Serving Layer. Note that the tags are strings, not numeric IDs; the Serving Layer translates back and forth.
Running the Computation
The Myrrix Computation Layer version 0.10 was run on this data set, using a 2-node Apache Hadoop 1.1.1 cluster. 50 features were used in the matrix factorization computation. Computation completed after 24 iterations in 5 hours. This implies that model building can process about 350,000 users per hour per worker. The factorized model consumed 800MB of storage, compressed.
jquery jquery-events jquery-live event-propagation
The top recommendations are:
This questions concerns computing eigenvalues efficiently from a sparse matrix. It's tagged as:
r matrix linear-algebra sparse-matrix eigenvalue
Recommended tags are:
math is a good suggestion, but broad.
matlab is a tool, like R, for manipulating matrices and frequently used to compute eigenvalues and eigenvectors, so is also related even if not the topic of this question.
image-processing can often involve eigenvector analysis.
statistics is not irrelevant, but not very related to this question.
eigenvector might have been expected here, but was scored 0.02639312, well below top recommendations. The model does predict that
curve-fitting are the most similar to
eigenvalue, but was not judged as relevant to this question given its other tags.
Cocos2D is a development environment for apps for the iPhone and iPad. It is tagged only as:
iphone uikit uiimage coregraphics uiimagepickercontroller
Recommended tags are:
ios is obviously relevant given that it is the iPhone OS. This concerns iPhone development, and
cocoa-touch are the language and framework for this. The other two are additional common UI classes, like those named in the question tags.