Basic structure

Text comprises a sequence of words (or commonly simply a bag of words). 文本通常由一系列的单词组成
Words are also called terms.

A sequence of words is aggregated into a document. 一系列单词被聚合成一个文档。
A set of documents is aggregated into a collection or corpus. 一组文档被聚合成一个集合或语料库。
A document can be described by a set of representative keywords called index terms. 一个文档可以用一组具有代表性的关键词来描述,这些关键词被称为索引术语。

  • These index terms take the role of attributes in Information Retrieval 这些索引术语在信息检索中扮演属性的角色
  • Can be binary-valued attributes: absence or presence 可以是二进制值属性:不存在或存在

But different index terms have varying relevance when used to describe document contents. 但是不同的索引术语在用于描述文档内容时具有不同的相关性。

  • This effect is captured through the assignment of numerical weights to each index term of a document (for example, frequency, or TF-IDF) 这种效果是通过给文档的每个索引项分配数字权重来获得的(例如,频率或TF-IDF)
  • These weights take the role of attribute values.这些权重扮演属性值的角色。

_

Basic Process

(1) Select index terms 选择索引术语
(2) Build an index (high dimensional term and document frequency matrices) 建立索引(高维项和文档频率矩阵)
(3) Match the query to the index to retrieve optimal answers, typically by a

  • Boolean model
  • Vector space model, or
  • Probablistic model (categories modeled by probability distributions, find likelihood a document belongs to a certain category, similar to Bayesian classification)