Text Data Analysis and Information Retrieval - building an index - 《机器学习》

Indexes need to link the (preprocessed) words in the document collection to the documents in which they occur. This is typically a term-document matrix or similar inverted index. A signature file is an alternative approach.
An inverted index is well-suited to parallel computation using methods like MapReduce over distributed file systems. 文件是一种替代方法。倒排索引非常适合在分布式文件系统上使用MapReduce等方法进行并行计算。
Example
Let us build a term-document matrix for documents D1 and D2 (term x document, with an aggregate document count also shown here)

.and eventually finish the term-document matrix with
Note that each column of the term-document matrix identifies the frequency of words in that document, which is a term vector, or a feature vector for that document.
In practice, this may be represented instead as a much more compact inverted index for large document collections (such as the Web) as: