Example: StackOverflow and Computation Layer

This example uses data from StackOverflow, a popular question-and-answer site for software developers. Developers ask and answer questions. To help find relevant questions, anyone may tag questions with short strings that indicate their topic. For example, a question related to recommenders in Apache Mahout might have tags like mahout, java, and hadoop.

StackOverflow publishes its entire database of questions. Here, Myrrix is used not to recommend questions to users (though this is possible). Instead, it is used to recommend tags to questions, to show how recommenders may be deployed when "users" aren't people, and "items" aren't objects like DVDs. For example, this might be useful in suggesting that a question tagged android and heap might also be tagged java and memory since Android is a Java-based platform and questions about Java's heap concern memory management.

Preparing the data

From the data dump, a simple CSV file consisting of "QuestionID,Tag" was extracted for all questions. This extract contains about 3.5M questions and 30,000 unique tags. This data was uploaded through the Serving Layer. Note that the tags are strings, not numeric IDs; the Serving Layer translates back and forth.

Running the Computation

The Myrrix Computation Layer version 0.10 was run on this data set, using a 2-node Apache Hadoop 1.1.1 cluster. 50 features were used in the matrix factorization computation. Computation completed after 24 iterations in 5 hours. This implies that model building can process about 350,000 users per hour per worker. The factorized model consumed 800MB of storage, compressed.

Representative Results

Use jQuery to hide DIV when click outside it, but allow propagation of events

This is a question about jQuery, a popular JavaScript library. It is tagged as:

jquery jquery-events jquery-live event-propagation

The top recommendations are:

  • javascript (0.7070474)
  • jquery-ui (0.57925606)
  • ajax (0.5371972)
  • jquery-plugins (0.48303664)
  • javascript-selectors (0.47164282)

javascript is perhaps overly broad, but relevant. The jQuery-related tags are obviously useful suggestions. AJAX is somewhat relevant, being a technology related to JavaScript and used with jQuery.

How expensive is it to compute the eigenvalues of a matrix?

This questions concerns computing eigenvalues efficiently from a sparse matrix. It's tagged as:

r matrix linear-algebra sparse-matrix eigenvalue

Recommended tags are:

  • matlab (0.837563)
  • math (0.63141507)
  • plot (0.5555035)
  • image-processing (0.5419019)
  • statistics (0.5233228)

math is a good suggestion, but broad. matlab is a tool, like R, for manipulating matrices and frequently used to compute eigenvalues and eigenvectors, so is also related even if not the topic of this question. image-processing can often involve eigenvector analysis. statistics is not irrelevant, but not very related to this question. eigenvector might have been expected here, but was scored 0.02639312, well below top recommendations. The model does predict that eigenvector and curve-fitting are the most similar to eigenvalue, but was not judged as relevant to this question given its other tags.

UIImagePickerController, UIImage, Memory and More?

Cocos2D is a development environment for apps for the iPhone and iPad. It is tagged only as:

iphone uikit uiimage coregraphics uiimagepickercontroller

Recommended tags are:

  • ios (0.8096546)
  • uiview (0.78206486)
  • objective-c (0.7554695)
  • cocoa-touch (0.7511055)
  • uiimageview (0.6989627)

ios is obviously relevant given that it is the iPhone OS. This concerns iPhone development, and objective-c and cocoa-touch are the language and framework for this. The other two are additional common UI classes, like those named in the question tags.

Clustering

Continue to an example that illustrates how to cluster tags into logical groups with the Computation Layer.