高频词提取 - 《自然语言处理(NLP)》

运行及结果：

指文档中出现频率较高且非无用的词语，在一定程度上代表了文档的焦点所在。
高频词提取其实就是自然语言处理中的TF策略。主要有下列干扰项：
1>标点符号：一般标点符号无任何价值，需要去除。
2>停用词：如“的”，“是”，“了”等常用词无任何意义，也需要剔除。

要分析的文章—news.txt

常用停用词文件：

`_# 高频词的提取
_import jieba

# 数据读取
_def get_content(path):
# 对要分析的数据进行读取和分词
with open(path,”r”,encoding=”gbk”) as f:
lines = f.readlines()
content = [line.strip() for line in lines]
content = str(content)
print(content)
# 对常用停用词文件进行读取
with open(“baidu_stopwords.txt”,”r”,encoding=”utf-8”) as ff:
rows = ff.readlines()
stopWords = [row.strip() for row in rows]
# 分词
_words = [w for w in jieba.cut(content) if w not in stopWords]
print(words)
return words

# 高频词统计
_def get_TF(words,topK = 10):
dics = {}
for w in words:
dics[w] = dics.get(w,0) + 1
# 根据值对词统计词典进行排序
_rs = sorted(dics.items(),key=lambda x:x[1],reverse=True)
print(rs[:10])`

运行及结果：

words = get_content("news.txt")<br />get_TF(words)

jieba要求的用户词典一般格式为：
每行为三个部分：
词语词频(可省略) 词性(可省略)