Web basics

The Web and Other Networks

Documents on the web are linked by hyperlinks. 网络上的文档通过超链接链接在一起。
Academic papers are linked by citations and co-authorship relations. 学术论文通过引用和合著关系联系在一起
Legal documents are linked by citations. 法律文件通过引用联系在一起
Online users are linked by interactions or formal social network ties. 在线用户通过互动或正式的社交网络联系在一起。

Nodes and Edges

Nodes represent entities (e.g. documents, people) 节点代表了实体
Edges are relationships between nodes. (e.g. hyperlinks, co-authorship relations) 边连线代表节点之间的关系
Edges can be directed (go in a specific direction) 连线是有向的

Degree of a Node 节点深度

• Degree: the number of edges connected to a node 度数:连接到
• In-degree: the number of edges going to a node 内度:连接指向节点的边
• Out-degree: the number of edges coming from a node. 外度:链接从节点指出的边

Link Analysis

• Link analysis uses information about the structure of the web graph to aid search. 连接分析使用网络图结构的信息来帮助搜索
• It is one of the major innovations in web search. 这是网络搜索的主要创新之一
• It was one of the primary reasons for Google’s initial success. 这就是谷歌最初成功的主要原因之一

Bibliometrics: Citation Analysis 文献计量学:引文分析

• Many documents include bibliographies with citations to other previously published documents. 许多文献包括引用其他以前出版的文献的参考书目。
• Using citations as edges, a collection of documents can be viewed as a graph. 使用引用作为边,可以将文档集合视为一个图形。
• The structure of this graph can provide interesting information about the similarity of documents and the structure of information. Even when document content is ignored. 该图的结构可以提供关于文档相似性和信息结构的有趣信息。即使文档内容被忽略。

Bibliographic Coupling 书目耦合

• Measure of similarity of documents introduced by Kessler in 1963. 凯斯勒在1963年引入的文档相似性度量。
• The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. 两个文件A和B的书目耦合是指A和B共同引用的文件数量
• Size of the intersection of their bibliographies. 书目交集的大小
• Maybe want to normalize by size of bibliographies? 通过书目的大小来标准化?
image.png

Co-Citation

• An alternate citation-based measure of similarity introduced by Small in 1973. Small在1973年引入了一种基于引用的替代相似性度量。
• Number of documents that cite both A and B. 同时引用A和B的文件数量
• Maybe want to normalize by total number of documents citing either A or B? 也许想通过引用A或B的文档总数来规范化?
image.png

Citations vs. Links 引用和链接的区别

• Web links are a bit different than citations: 网页链接和引用会有一些区别
– Many links are navigational. 很多链接是导航性的
– Many pages with high in-degree are portals not content providers. 很多具有高内向度数的网站是门户而不是内容提供者。
– Not all links are endorsements. 并非所有链接都是背书。
– Company websites don’t point to their competitors. 公司网站不会指向竞争对手。
– Citations to relevant literature is enforced by peer-review. 相关文献的引用由同行评审强制执行。

Authorities 权威

Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. 权威是指被认为在某个主题上提供重要、可信和有用信息的页面。In-degree(指向页面的指针数)是一个简单的权威度量。
• However, in-degree treats all links as equal. 内向度将所有的链接视为相等
• Should links from pages that are themselves authoritative count more? 来自本身具有权威性的页面的链接应该更重要吗?

Hubs 中心

Hubs are index pages that provide lots of useful links to relevant content pages (authorities). 中心是索引页面,提供了许多到相关内容页面(权威)的有用链接。
image.png

HITS

Determines hubs and authorities for a particular topic through analysis of a relevant subgraph 通过对相关子图的分析,确定特定主题的中心和权威
• Algorithm developed by Kleinberg in 1998. 克莱因伯格在1998年开发的算法。
• Based on mutually recursive, 基于递归
assumptions:
– Hubs point to lots of authorities. 枢纽指向许多权威页面。
– Authorities are pointed to by lots of hubs. 权威页面都被中心枢纽所指。

HITS 算法

Computes hub and authority scores for documents on a particular topic 对某一特定话题,计算文件的中心和权威分数。
• The topic is specified by a query. 通过查询语句确定主题
• Relevant pages from the query are used to construct a base subgraph S. 来自查询的相关页面被用于构建基本子图S
• Analyze the link structure of pages in S to find authority and hub pages. 分析S中页面的连接结构,找到权威和枢纽页面。