HITS algorithm
Hyperlink-Induced Topic Search (HITS) (also known as Hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a precursor to PageRank. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
History
In Journals
In the past, many methods were used to rank the importance of scientific journals. One such method was Garfield's impact factor. However, many journals such as Science and Nature are filled with numerous citations, making these magazines have very high impact factors. Thus, if we compare two more obscure journals which have received roughly the same number of citations but one of these journals has received many citations from Science and Nature, we want this journal to be ranked higher. In other words, it is better to receive citations from an important journal than from an unimportant one.
On the Web
This phenomenon also occurs in the Internet. Counting the number of links to a page can give us a general estimate of its prominence on the Web, but a page with very few incoming links may also be prominent, if two of these links come from the home pages of Yahoo! or Google or MSN. Thus, because these sites are of very high importance but are also Search Engines, there can be very irrelevant results.
Algorithm
In the HITS algorithm, the first step is to retrieve the set of results to the search query. The computation is performed only on this result set, not across all Web pages.
Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages.
The algorithm performs a series of iterations, each consisting of two basic steps:
Authority Update: Update each node's Authority score to be equal to the sum of the Hub Scores of each node that points to it. That is, a node is given a high authority score by being linked to by pages that are recognized as Hubs for information.
Hub Update: Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.
The Hub score and Authority score for a node is calculated with the following algorithm:
Start with each node having a hub score and authority score of 1.
Run the Authority Update Rule
Run the Hub Update Rule
Normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores.
Repeat from the second step as necessary.
HITS, like Page and Brin's PageRank, is an iterative algorithm based on the linkage of the documents on the web. However it does have some major differences:
It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub and authority scores assigned to a page are query-specific.
It is not commonly used by search engines. (Though a similar algorithm was said to be used by Teoma, which was acquired by Ask.com.)
It computes two scores per document, hub and authority, as opposed to a single score.
It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.
For more details - http://en.wikipedia.org/wiki/HITS_algorithm
0 komentar:
Post a Comment