安装

  1. pip install -U spaCy
  2. python -m spacy download en

特性

spaCy的一大特点是在进行tokenize的时候会保留下标,它在进行标记时还会保留原字符串中的空格(分割单词的单个空格不算)。这与NLTK是有很大不同的,每个单词可以通过idx下标轻松获取到对应原字符串中的位置。

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp('Hello World!')
  3. for token in doc:
  4. print('"' + token.text + '"', token.idx)
  5. 结果:
  6. "Hello" 0
  7. " " 6
  8. "World" 7
  9. "!" 12

句子检测

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp("These are apples. These are oranges.")
  3. for sent in doc.sents:
  4. print(sent)
  5. 结果:
  6. These are apples.
  7. These are oranges.

词性标注(POS Tagging)

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp("Next week I'll be in Madrid.")
  3. print([(token.text, token.tag_) for token in doc])
  4. 结果:
  5. [('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]

命名实体识别

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp("Next week I'll be in Madrid.")
  3. for ent in doc.ents:
  4. print(ent.text, ent.label_)
  5. 结果:
  6. Next week DATE
  7. Madrid GPE

使用notebook可以对结果进行可视化:

  1. import spacy
  2. from spacy import displacy
  3. nlp = spacy.load("en_core_web_sm")
  4. doc = nlp("Next week I'll be in Madrid.")
  5. displacy.render(doc, style='ent', jupyter=True)
  6. for ent in doc.ents:
  7. print(ent.text, ent.label_)

image.png

分词(词组)

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
  3. for chunk in doc.noun_chunks:
  4. print(chunk.text, chunk.label_, chunk.root.text)
  5. # Wall Street Journal NP Journal
  6. # an interesting piece NP piece
  7. # crypto currencies NP currencies

依存分析

  1. nlp = spacy.load("en_core_web_sm")
  2. doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
  3. for token in doc:
  4. print("{0}/{1} <--{2}-- {3}/{4}".format(
  5. token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
  6. # Wall/NNP <--compound-- Street/NNP
  7. # Street/NNP <--compound-- Journal/NNP
  8. # Journal/NNP <--nsubj-- published/VBD
  9. # just/RB <--advmod-- published/VBD
  10. # published/VBD <--ROOT-- published/VBD
  11. # an/DT <--det-- piece/NN
  12. # interesting/JJ <--amod-- piece/NN
  13. # piece/NN <--dobj-- published/VBD
  14. # on/IN <--prep-- piece/NN
  15. # crypto/JJ <--compound-- currencies/NNS
  16. # currencies/NNS <--pobj-- on/IN

同样,使用notebook可以进行可视化

  1. from spacy import displacy
  2. nlp = spacy.load("en_core_web_sm")
  3. doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
  4. displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

spaCy - 图2