Indexes need to link the (preprocessed) words in the document collection to the documents in which they occur. This is typically a term-document matrix or similar inverted index. A signature file is an alternative approach.
    An inverted index is well-suited to parallel computation using methods like MapReduce over distributed file systems. 文件是一种替代方法。 倒排索引非常适合在分布式文件系统上使用MapReduce等方法进行并行计算。
    Example
    Let us build a term-document matrix for documents D1 and D2 (term x document, with an aggregate document count also shown here)

    image.pngimage.pngimage.png
    .and eventually finish the term-document matrix withimage.png
    Note that each column of the term-document matrix identifies the frequency of words in that document, which is a term vector, or a feature vector for that document.
    In practice, this may be represented instead as a much more compact inverted index for large document collections (such as the Web) as:
    image.png