h3>5.0.1 标记
一段文本 = "我爱北京天安门"标记器 = worker("tag")结果 = tagging(一段文本, 标记器)print(结果)#> r v ns ns#> "我" "爱" "北京" "天安门"names(tagging(一段文本, 标记器))#> [1] "r" "v" "ns" "ns"
对已经分好词的文本进行标记
分词器 = worker()分词结果 = segment(一段文本, 分词器)分词结果#> [1] "我" "爱" "北京" "天安门"vector_tag(分词结果, 标记器)#> r v ns ns#> "我" "爱" "北京" "天安门"
5.0.2 关键词
topn 控制提取数量
提取器 = worker("keywords", topn = 1)keywords("我爱北京天安门", 提取器)#> 8.9954#> "天安门"
对已经分好词的文本进行标记
分词器 = worker()分词结果 = segment(一段文本, 分词器)分词结果#> [1] "我" "爱" "北京" "天安门"vector_keywords(分词结果, 提取器)#> 8.9954#> "天安门"
5.0.3 Simhash 与海明距离
摘要器 = worker("simhash", topn=2)simhash("江州市长江大桥参加了长江大桥的通车仪式", 摘要器)#> $simhash#> [1] "12882166450308878002"#>#> $keyword#> 22.3853 8.69667#> "长江大桥" "江州"distance("hello world!", "江州市长江大桥参加了长江大桥的通车仪式", 摘要器)#> $distance#> [1] 23#>#> $lhs#> 11.7392 11.7392#> "hello" "world"#>#> $rhs#> 22.3853 8.69667#> "长江大桥" "江州"
vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),摘要器)#> $simhash#> [1] "12098690169796312660"#>#> $keyword#> 6.45994 6.18823#> "天气" "不错"vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),摘要器)#> $distance#> [1] 0#>#> $lhs#> 6.45994 6.18823#> "天气" "不错"#>#> $rhs#> 6.45994 6.18823#> "天气" "不错"
5.0.4 tobin 进行 Simhash 数值的二进制转换。
tobin("12098690169796312660")#> [1] "1010011111100111001011101001101110011010001110000011111001010100"
5.0.5 词频统计 freq()
freq(c("测试", "测试", "文本"))#> char freq#> 1 文本 1#> 2 测试 2
5.0.6 生成 IDF 文件 get_idf()
根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档,可自定义停止词列表。
临时输出目录 = tempfile()a_big_list = list(c("测试","一下"),c("测试"))get_idf(a_big_list, stop = jiebaR::STOPPATH, path = 临时输出目录)readLines(临时输出目录)#> [1] "一下 0.693147180559945" "测试 0"
