假如对如下的两段文本建立倒排索引:
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.
分词,初步的倒排索引的建立
| word | doc1 | doc2 | | —- | —- | —- |
| I | | |
| really | * | |
| liked | | |
| my | | |
| small | * | |
| dogs | * | |
| and | * | |
| think | * | |
| mom | | |
| also | * | |
| them | * | |
| He | | * |
| never | | * |
| any | | * |
| so | | * |
| hope | | * |
| that | | * |
| will | | * |
| not | | * |
| expect | | * |
| me | | * |
| to | | * |
| him | | * |
演示了一下倒排索引最简单的建立的一个过程
搜索mother like little dog
,不可能有任何结果
以上搜索文件本,被拆分如下:
mother
like
little
dog
这个是不是我们想要的搜索结果???绝对不是,因为在我们看来,mother和mom有区别吗?同义词,都是妈妈的意思。like和liked有区别吗?没有,都是喜欢的意思,只不过一个是现在时,一个是过去时。little和small有区别吗?同义词,都是小小的。dog和dogs有区别吗?狗,只不过一个是单数,一个是复数。
normalization: 时态的转换,单复数的转换,同义词的转换,大小写的转换
建立倒排索引的时候,会执行一个 normalization 操作,也就是说对拆分出的各个单词进行相应的处理,以提升后面搜索的时候能够搜索到相关联的文档的概率
mom —> mother
liked —> like
small —> little
dogs —> dog
重新建立倒排索引,加入normalization
,再次用mother liked little dog
搜索,就可以搜索到了
word doc1 doc2
I
really
like liked —> like
my
little small —> little
dog dogs —> dog
and
think
mom
also
them
He
never
any
so
hope
that
will
not
expect
me
to
him
| word | doc1 | doc2 | normalization后 | | —- | —- | —- | —- |
| I | | | |
| really | * | | |
| like | | | liked—>like |
| my | | | |
| little | | | small—>little |
| dog | | | dogs—>dog |
| and | * | | |
| think | * | | |
| mom | | | |
| also | * | | |
| them | * | | |
| He | | * | |
| never | | * | |
| any | | * | |
| so | | * | |
| hope | | * | |
| that | | * | |
| will | | * | |
| not | | * | |
| expect | | * | |
| me | | * | |
| to | | * | |
| him | | * | |
搜索 mother like little dog
时,会对文本进行分词,并且也会对搜索的文本进行 normalization
操作:
mother --> mom
like --> like
little --> little
dog --> dog