Example: Wikipedia Links

Wikipedia, the popular online encyclopedia, provides another opportunity to demonstrate a recommender at scale, applied to unconventional input. Wikipedia articles link to other articles. These links can be viewed as associations or "votes" among articles. Considering articles to be both "users" and "items", we can recommend articles to articles. This could uncover articles that are related to a given article, and maybe ought to be linked, but are not.

Preparing the Data

The February 2013 dump for the entire English-language Wikipedia site was obtained from Wikimedia. Page links are given as a SQL script in enwiki-20130204-pagelinks.sql.gz and page titles in enwiki-20130204-page.sql.gz. From these sources, a simple CSV file containing source page ID and target page ID can be constructed for each link. For purposes of this example, only pages with 3 or more links in or out were kept. The result contains about 352M links from 8.5M articles to 5.3M other unique articles.

Running the Computation

The Myrrix Computation Layer 1.0 RC1 was run on an Amazon Elastic MapReduce. 8 m1.xlarge workers were used. 100 features were used, and model building completed in 19 hours after 18 iterations. (Subsequent incremental model updates would only take a couple iterations; this is a complete build from scratch.) Using spot instances, total cost of the computation was $31.85. This implies that an initial model build costs about $2.31 per million users or items, or about $0.09 per million links, here (and subsequent updates, about 10% of that).

Representative Results

One interesting observation about these results is that the most relevant articles are almost surely already linked. The recommendations are articles not already linked, and tend to relate to the topic of linked articles, not to the article itself.

Ulysses_S._Grant
  • Ulysses_S._Grant (0.94508547)
  • United_States_House_of_Representatives (0.9362344)
  • New_York's_Meatpacking_District (0.9140202)
  • Democratic_Party_(United_States) (0.90036565)
  • Speaker_of_the_United_States_House _of_Representatives (0.8991539)

Ulysses S. Grant was the 18th US president. Not surprisingly, the article itself appears as a top recommendation; it's not linked to itself! The House of Representatives and an article on its leadership role are found in the top likely linked pages. These seem broadly relevant, but, Grant was never a representative. Grant was a Republican; the Democratic party appears here too as relevant to a US president but not directly applicable to Grant; the Republican party page is of course already linked. All are closely related to the subject of Grant, though none have directly described Grant's career. The link to New York's meatpacking district is less clear; Grant did live in New York.

History_of_Indonesia
  • Bahá'í_Faith_by_continent (0.6337122)
  • South_Asia (0.5999026)
  • Bhutan (0.59951043)
  • Laos (0.5987551)
  • Maldives (0.5853035)

The Maldives seem relevant, being another Islamic island nation. South Asia is of course relevant given Indonesia's location. Laos is another nation is Southeast Asia; not already linked perhaps because of direct relevance to Indonesia's history, but, nevertheless another nation mentioned alongside countries that are; similarly for Bhutan. The link to the Bahá'í religion is interesting because, while it has only a small following, it is notably allowed in Indonesia even though its dominant religion is Islam.

Eurovision_Song_Contest
  • Austria (0.7137562)
  • Greece (0.6876324)
  • Hungary (0.6846089)
  • Ukraine (0.67851347)
  • Soviet Union (0.65409476)

The Eurovision Song Contest is a pop music competition featuring acts from many European countries, and a source of political controversy and national pride. Not surprisingly, recommendations are European countries, but those less related to the contest's history; indeed most of the existing links on the page are to European countries with most wins and participation. The recommendation list includes countries that didn't participate or win (Hungary, Soviet Union), or have won only once.

Justinian_I
  • Roman_Republic (0.9625839)
  • Roman_Emperor (0.86358315)
  • Alexander_the_Great (0.84353495)
  • Justinian_I (0.8303596)
  • Julius_Caesar (0.8261316)

Justinian was a 6th-century emperor of the Eastern Roman Empire, or Byzantine Empire, at a time when the Western Roman Empire declined significantly. "Roman Emperor" is not linked directly perhaps because this refers typically to Western Roman emperors. It is not surprising to see him linked to another (Western Roman) emperor, and the Roman Republic. Both of these are famous and obviously related but predated Justinian and were not directly relevant to his life. Alexander the Great is another famous figure of antiquity, although also predated Justinian by centuries and so has little direct relation to his reign. Again, the page itself appears as a top link.