Syntactic Analysis – determining the syntactic structure of text by analyzing the underlying grammar (of the language) 句法分析——通过分析(语言的)基本语法来确定文本的句法结构
• Syntax = how words combine to form phrases and sentences 语法Syntax = 单词怎么样结合形成短语和句子
• Gives a deeper understanding of word groups and their grammatical relationships 更深入地理解词群及其语法关系
• Sentences are not simply bags of words: 句子不仅仅是一大堆单词:
Mary bought John a coffee
vs
John bought Mary a coffee
Parsing
• Formally tries to resolve structural ambiguity in text E.g., Mary saw a cat with binoculars 尝试解决文本中的结构歧义
• Typically, in the broad context of the NLP Pipeline: Tokenize → POS Tag → Parse → …
• Applications: 应用
− Machine Translation 机器翻译
− Question Answering 问答
− Text Summarisation 文本总结
− Grammar Checking 语法检查
− Information Extraction 信息提取
• Constituency Parsing 选取分析
• Phrases represented as nodes in a tree 短语表示为树中节点的短语
• Dependency Parsing 依赖性解析
• Dependencies between words 单词之间的依赖关系
• Dependency parsing is typically faster and works for many (all?) languages 依赖解析通常更快,适用于许多(所有?)语言
• Constituency parsing tends to favor languages with somewhat fixed word order patterns, and clear constituency structures e.g., English 选区分析倾向于使用具有固定词序模式和清晰选区结构的语言,例如英语
How to represent sentence structure 怎样去表示句子结构?
Constituency tree 选区树
(phrase structure tree) (短语结构树)
Constituency Parsing
Adds more structure to POS tagged sentences 给词性标注的句子增加更多的结构
Splits sentences into sub-phrases or constituents 将句子分成子短语或成分
Tree form:
Types of phrases = non-terminals 短语类型=非终结符
Words in the sentence = terminals 句子中的单词=终端
Also known as Phrase Structure Trees 也称为短语结构树
Constituent Parse Tree
Constituent: a word or a group of words that behaves as a single unit 一个单词或一组单词表现成一个单独的单元格
Why do these words group together? 为什么他们需要成组?
• Appear in similar syntactic environments 出现在一些小的语境中
• Preposed or postposed construction 前置或后置结构
A context free grammar consists of: 上下文无关语法包括:
• a set of context-free rules, each of which expresses the ways that symbols of the language can be grouped and ordered together 一组与上下文无关的规则,每条规则都表达了语言符号可以分组和排序的方式
NP是名词短语、VP是动词短语、PP是介词短语。
• a lexicon of words and symbols, and a set of rules which express facts about the lexicon. 词汇和符号词典,以及一套表达词典事实的规则。
These are the building blocks of a Constituency Parser 这些是选区解析器的构建块
Context-Free Grammars (CFGs) 上下文无关法
CFGs are more general than Regular Grammars
Formal Definition of CFG
CFGs Derivations CFG衍生
The sequence of rule expansions is called a derivation of the string of words 规则扩展的序列被称为单词串的派生
A Toy Example
Probabilistic Context Free Grammars (PCFGs) 概率上下文无关法则
A parameter to each grammar rule [3]
Learning PCFG from Treebanks
• Penn treebank and English Web treebank
Grammar Equivalence
• Two grammars are equivalent if they generate the same language (set of strings)
• Chomsky Normal Form (CNF)
• Allow only two types of rules. The right-hand side of each rule either has two non-terminals or one terminal,
• except (where is the empty string)
Top Down Parsing 上到下编译
Bottom Up Parsing 从下到上编译
Available Constituency Parsers
Stanford Parser http://nlp.stanford.edu/software/srparser.shtml
Berkley Neural Parser https://spacy.io/universe/project/self-attentive-parser
UCSD Rethinking Self-Attention https://github.com/KhalilMrini/LAL-Parser