NLTK入门

首先使用pip install 安装库,随后安装语料包。可以使用如下代码

  1. import nltk
  2. nltk.download()

没开梯子0下载量,开了梯子也卡的不行,直接下载本地包进行安装

https://blog.csdn.net/Csharp289637169/article/details/54344260

下载之后的放哪里?可以使用代码进行尝试运行,报错信息会给出语料包搜索路径,选一个放置即可

  1. from nltk.corpus import brown
  2. print(brown.categories())

至此已经配置完成,尝试运行第一个简单的程序

image.png
随后开始从尝试统计词频

  1. import nltk
  2. import string
  3. from nltk.corpus import stopwords
  4. import nltk.stem
  5. sentence="hello world"
  6. f = None
  7. try:
  8. f = open("D:\\New_desktop\\Alice_s_Adventures_in_Wonderland_057\\Alice's Adventures in Wonderland - Lewis Carroll.txt", 'r', encoding='utf-8')
  9. sentence=f.read()
  10. except FileNotFoundError:
  11. print('无法打开指定的文件!')
  12. except LookupError:
  13. print('指定了未知的编码!')
  14. except UnicodeDecodeError:
  15. print('读取文件时解码错误!')
  16. finally:
  17. if f:
  18. f.close()
  19. lower = sentence.lower()
  20. remove = str.maketrans('','',string.punctuation)
  21. without_punctuation = lower.translate(remove)
  22. print(without_punctuation)
  23. tokens = nltk.word_tokenize(without_punctuation)
  24. without_stopwords = [w for w in tokens if not w in stopwords.words('english')]
  25. s = nltk.stem.SnowballStemmer('english') #参数是选择的语言
  26. cleaned_text = [s.stem(ws) for ws in without_stopwords]
  27. print(cleaned_text)
  28. freq = nltk.FreqDist(cleaned_text)
  29. freq.plot(50, cumulative=False)
  30. for key,val in freq.items():
  31. print (str(key) + ':' + str(val))

统计的结果
image.png

最后附上测试用的各种数据源
123.txt12.txtJane Eyre - Charlotte Bronte.txt