编译原理之美 - 【源码分析】编译原理之美手动实现脚本解释器源码分析 - 《软件工程知识体系》

概述
项目目录分析
词法分析器
REPL界面

概述

课程：编译原理之美
源码路径：https://gitee.com/richard-gong/PlayWithCompiler/tree/master/lab/craft

项目目录分析

词法分析

Token.java : 此文件定义了 Token 接口。
TokenReader.java : 此文件定义了 TokenReader 接口
TokenType.java : 此文件定义了可识别的 Token 类别
SimpleLexer.java : 此文件定义了一个简单的手写的词法分析器，其中包括一个 TokenReader 实现、状态机状态定义和一个根据字符串解析 Token 的函数。

两个接口的定义可以有效隔离类与类之间的耦合，同时也只将必要的方法进行暴露。

语法解析

语法解析目前处理的是上下文无关的文法

ASTNode.java : 此文件定义了 ASTNode 接口。 [ AST : 抽象语法树 ]
ASTNodeType.java : 此文件定义了 ASTNode 的类别。
SimpleCalculator.java : 此文件定义了一个可以做简单四则运算并给出结果.
SimpleParser.java : 此文件定义了一个简单的语法解析器, 支持表达式, 变量声明和初始化语句, 赋值语句.

语义识别
SimpleScript.java : 此文件定义了一个简单的脚本解释器, 其支持的语法在 SimpleParser.java 中.

词法分析器

Token.java
Token 需要包括类型和 原始值 两个属性。 原始值 存储为 String 类型。 ```java /**
- 一个简单的Token。
- 只有类型和文本值两个属性。 */ public interface Token{
  
  /**
  - Token的类型
  - @return */ public TokenType getType();
  /**
  - Token的文本值
  - @return */ public String getText();

}

<a name="Lvuvv"></a>
### TokenReader.java
利用 **TokenReader **对 **Token **的去取逻辑进行封装，方便后面增加必要的控制函数和逻辑。此类中的返回值都利用了 **Token **接口，实现对 Token 相关类的隔离。
```java
/**
 * 一个Token流。由Lexer生成。Parser可以从中获取Token。
 */
public interface TokenReader{
    /**
     * 返回Token流中下一个Token，并从流中取出。 如果流已经为空，返回null;
     */
    public Token read();
    /**
     * 返回Token流中下一个Token，但不从流中取出。 如果流已经为空，返回null;
     */
    public Token peek();
    /**
     * Token流回退一步。恢复原来的Token。
     */
    public void unread();
    /**
     * 获取Token流当前的读取位置。
     * @return
     */
    public int getPosition();
    /**
     * 设置Token流当前的读取位置
     * @param position
     */
    public void setPosition(int position);
}

TokenType.java

可以解析的 Token 类型定义为“四则运算”，“大小判断”，“表达式符号”，“赋值”，“条件”，“变量定义”，“字面量”。

/**
 * Token的类型
 */
public enum TokenType{
    Plus,   // +
    Minus,  // -
    Star,   // *
    Slash,  // /
    GE,     // >=
    GT,     // >
    EQ,     // ==
    LE,     // <=
    LT,     // <
    SemiColon, // ;
    LeftParen, // (
    RightParen,// )
    Assignment,// =
    If,
    Else,
    Int,
    Identifier,     //标识符
    IntLiteral,     //整型字面量
    StringLiteral   //字符串字面量
}

SimpleLexer.java

此文件是词法分析器的核心组件，由于源文件内容比较多下面先总分析，再分别分析各个主要功能的实现逻辑。

/**
 * 一个简单的手写的词法分析器。
 * 能够为后面的简单计算器、简单脚本语言产生Token。
 */
public class SimpleLexer {
    //测试解析器的能力
    public static void main(String args[]) {
        ...
    }
    //下面几个变量是在解析过程中用到的临时变量,如果要优化的话，可以塞到方法里隐藏起来
    private StringBuffer tokenText = null;   //临时保存token的文本
    private List<Token> tokens = null;       //保存解析出来的Token
    private SimpleToken token = null;        //当前正在解析的Token
    //是否是字母
    private boolean isAlpha(int ch) {
        ...
    }
    //是否是数字
    private boolean isDigit(int ch) {
        ...
    }
    //是否是空白字符
    private boolean isBlank(int ch) {
        ...
    }
    /**
     * 有限状态机进入初始状态。
     * 这个初始状态其实并不做停留，它马上进入其他状态。
     * 开始解析的时候，进入初始状态；某个Token解析完毕，也进入初始状态，在这里把Token记下来，然后建立一个新的Token。
     */
    private DfaState initToken(char ch) {
         ...
    }
    /**
     * 解析字符串，形成Token。
     * 这是一个有限状态自动机，在不同的状态中迁移。
     */
    public SimpleTokenReader tokenize(String code) {
         ...
    }
    /**
     * Token的一个简单实现。只有类型和文本值两个属性。
     */
    private final class SimpleToken implements Token {
        ...
    }
    /**
     * 打印所有的Token
     */
    public static void dump(SimpleTokenReader tokenReader){
        ...
    }
    /**
     * 有限状态机的各种状态。
     */
    private enum DfaState {
       ...
    }
    /**
     * 一个简单的Token流。是把一个Token列表进行了封装。
     */
    private class SimpleTokenReader implements TokenReader {
       ...
    }
}

SimpleToken

这是一个对 Token 接口的实现。由于此类定义在 SimpleLexer 内部，因此在 SimpleLexer 中可以直接调用 SimpleToken 中的私有变量进行赋值。根据设计分析，在词法解析器之外是不需要也不应该修改 **type** 和 **text** 参数。

    /**
     * Token的一个简单实现。只有类型和文本值两个属性。
     */
    private final class SimpleToken implements Token {
        //Token类型
        private TokenType type = null;
        //文本值
        private String text = null;
        @Override
        public TokenType getType() {
            return type;
        }
        @Override
        public String getText() {
            return text;
        }
    }

SimpleTokenReader

这是一个对 TokenReader 接口的实现。利用 List 存储 Token，并利用 List 特性覆盖接口函数。

private class SimpleTokenReader implements TokenReader {
    List<Token> tokens = null; 
    int pos = 0; // 记录当前的读取位置
    public SimpleTokenReader(List<Token> tokens) {
        this.tokens = tokens;
    }
    @Override
    public Token read() {
        // 这里并没有真的将 Token 移出数组，只是让标志为后移
        if (pos < tokens.size()) {
            return tokens.get(pos++);
        }
        return null;
    }
    @Override
    public Token peek() {
        if (pos < tokens.size()) {
            return tokens.get(pos);
        }
        return null;
    }
    @Override
    public void unread() {
        if (pos > 0) {
            pos--;
        }
    }
    @Override
    public int getPosition() {
        return pos;
    }
    @Override
    // 设置读取位置
    public void setPosition(int position) {
        if (position >=0 && position < tokens.size()){
            pos = position;
        }
    }
}

DfaState

根据状态图定义的状态值

/**
 * 有限状态机的各种状态。
 */
private enum DfaState {
    Initial,
    If, Id_if1, Id_if2, Else, Id_else1, Id_else2, Id_else3, Id_else4, Int, Id_int1, Id_int2, Id_int3, Id, GT, GE,
    Assignment,
    Plus, Minus, Star, Slash,
    SemiColon,
    LeftParen,
    RightParen,
    IntLiteral
    }

tokenize

此为词法解析器对外暴露的唯一方法，够建一个解析器后可以用此方法去解析所有的数据。

其接受一个字符串进行 Token 解析。
此函数实现了状态机的切换和 Token 数组的构建。
其中临时变量的初始化还可以提取函数。
它也存在最后一个 Token 的存储问题。

每个地方都使用 initToken 进行处理，我个人认为并不是很妥当。

public SimpleTokenReader tokenize(String code) {
  tokens = new ArrayList<Token>();    // Token 临时数组复位
  CharArrayReader reader = new CharArrayReader(code.toCharArray());
  tokenText = new StringBuffer(); // 临时值复位
  token = new SimpleToken(); // 临时 Token 复位
  int ich = 0;
  char ch = 0;
  DfaState state = DfaState.Initial;  // 状态装换为初始态
  try {
      while ((ich = reader.read()) != -1) {
          ch = (char) ich;
          switch (state) {
              case Initial:
                  state = initToken(ch);          //重新确定后续状态
                  break;
              case Id:
                  if (isAlpha(ch) || isDigit(ch)) {
                      tokenText.append(ch);       //保持标识符状态
                  } else {
                      state = initToken(ch);      //退出标识符状态，并保存Token
                  }
                  break;
              case GT:
                  if (ch == '=') {
                      token.type = TokenType.GE;  //转换成GE
                      state = DfaState.GE;
                      tokenText.append(ch);
                  } else {
                      state = initToken(ch);      //退出GT状态，并保存Token
                  }
                  break;
              case GE:
              case Assignment:
              case Plus:
              case Minus:
              case Star:
              case Slash:
              case SemiColon:
              case LeftParen:
              case RightParen:
                  state = initToken(ch);          //退出当前状态，并保存Token
                  break;
              case IntLiteral:
                  if (isDigit(ch)) {
                      tokenText.append(ch);       //继续保持在数字字面量状态
                  } else {
                      state = initToken(ch);      //退出当前状态，并保存Token
                  }
                  break;
              case Id_int1:
                  if (ch == 'n') {
                      state = DfaState.Id_int2;
                      tokenText.append(ch);
                  }
                  else if (isDigit(ch) || isAlpha(ch)){
                      state = DfaState.Id;    //切换回Id状态
                      tokenText.append(ch);
                  }
                  else {
                      state = initToken(ch);
                  }
                  break;
              case Id_int2:
                  if (ch == 't') {
                      state = DfaState.Id_int3;
                      tokenText.append(ch);
                  }
                  else if (isDigit(ch) || isAlpha(ch)){
                      state = DfaState.Id;    //切换回id状态
                      tokenText.append(ch);
                  }
                  else {
                      state = initToken(ch);
                  }
                  break;
              case Id_int3:
                  if (isBlank(ch)) {
                      token.type = TokenType.Int;
                      state = initToken(ch);
                  }
                  else{
                      state = DfaState.Id;    //切换回Id状态
                      tokenText.append(ch);
                  }
                  break;
              default:
          }
      }
      // 把最后一个token送进去
      if (tokenText.length() > 0) {
          initToken(ch);
      }
  } catch (IOException e) {
      e.printStackTrace();
  }
  return new SimpleTokenReader(tokens);
}

initToken

此函数处理状态的初始化以及 Token 数据的保存。我个人认为此函数中功能并不单一，应该将保存的功能单独列出来。

/**
* 有限状态机进入初始状态。
* 这个初始状态其实并不做停留，它马上进入其他状态。
* 开始解析的时候，进入初始状态；某个Token解析完毕，也进入初始状态，在这里把Token记下来，然后建立一个新的Token。
* @param ch
* @return
*/
private DfaState initToken(char ch) {
  if (tokenText.length() > 0) {
      token.text = tokenText.toString();
      tokens.add(token);
      tokenText = new StringBuffer();
      token = new SimpleToken();
  }
  DfaState newState = DfaState.Initial;
  if (isAlpha(ch)) {              //第一个字符是字母
      if (ch == 'i') {
          newState = DfaState.Id_int1;
      } else {
          newState = DfaState.Id; //进入Id状态
      }
      token.type = TokenType.Identifier;
      tokenText.append(ch);
  } else if (isDigit(ch)) {       //第一个字符是数字
      newState = DfaState.IntLiteral;
      token.type = TokenType.IntLiteral;
      tokenText.append(ch);
  } else if (ch == '>') {         //第一个字符是>
      newState = DfaState.GT;
      token.type = TokenType.GT;
      tokenText.append(ch);
  } else if (ch == '+') {
      newState = DfaState.Plus;
      token.type = TokenType.Plus;
      tokenText.append(ch);
  } else if (ch == '-') {
      newState = DfaState.Minus;
      token.type = TokenType.Minus;
      tokenText.append(ch);
  } else if (ch == '*') {
      newState = DfaState.Star;
      token.type = TokenType.Star;
      tokenText.append(ch);
  } else if (ch == '/') {
      newState = DfaState.Slash;
      token.type = TokenType.Slash;
      tokenText.append(ch);
  } else if (ch == ';') {
      newState = DfaState.SemiColon;
      token.type = TokenType.SemiColon;
      tokenText.append(ch);
  } else if (ch == '(') {
      newState = DfaState.LeftParen;
      token.type = TokenType.LeftParen;
      tokenText.append(ch);
  } else if (ch == ')') {
      newState = DfaState.RightParen;
      token.type = TokenType.RightParen;
      tokenText.append(ch);
  } else if (ch == '=') {
      newState = DfaState.Assignment;
      token.type = TokenType.Assignment;
      tokenText.append(ch);
  } else {
      newState = DfaState.Initial; // skip all unknown patterns
  }
  return newState;
}

辅助函数

```java //是否是字母 private boolean isAlpha(int ch) { return ch >= ‘a’ && ch <= ‘z’ || ch >= ‘A’ && ch <= ‘Z’; }

//是否是数字 private boolean isDigit(int ch) { return ch >= ‘0’ && ch <= ‘9’; }

//是否是空白字符 private boolean isBlank(int ch) { return ch == ‘ ‘ || ch == ‘\t’ || ch == ‘\n’; }

/**

 * 打印所有的Token
 * @param tokenReader
 */

public static void dump(SimpleTokenReader tokenReader){ System.out.println(“text\ttype”); Token token = null; while ((token= tokenReader.read())!=null){ System.out.println(token.getText()+”\t\t”+token.getType()); } }

<a name="DtP0A"></a>
## 语法解析器
<a name="ASTNode.java"></a>
### ASTNode.java
**ASTNode **接口提供子节点，父节点以及自身类型和文本值的方法。
```java
/**
 * AST的节点。
 * 属性包括AST的类型、文本值、下级子节点和父节点
 */
public interface ASTNode{
    //父节点
    public ASTNode getParent();

    //子节点
    public List<ASTNode> getChildren();

    //AST类型
    public ASTNodeType getType();

    //文本值
    public String getText();
}

ASTNodeType.java

此处理的类型定义将一些标识符进行了简化

/**
 * AST节点的类型。
 */
public enum ASTNodeType{
    Programm,           //程序入口，根节点

    IntDeclaration,     //整型变量声明
    ExpressionStmt,     //表达式语句，即表达式后面跟个分号
    AssignmentStmt,     //赋值语句

    Primary,            //基础表达式
    Multiplicative,     //乘法表达式
    Additive,           //加法表达式

    Identifier,         //标识符
    IntLiteral          //整型字面量
}

SimpleCalculator.java

语法规则：

 * additive -> multiplicative | multiplicative + additive
 * multiplicative -> primary | primary * multiplicative
 * primary -> IntLiteral | ( additive )

这里实现了一个根据语言分析进行四则运算的语法，语法本身存在结合性问题即会先算后面的结果再算前面的内容

/**
 * 实现一个计算器，但计算的结合性是有问题的。因为它使用了下面的语法规则：
 *
 * additive -> multiplicative | multiplicative + additive
 * multiplicative -> primary | primary * multiplicative    //感谢@Void_seT，原来写成+号了，写错了。
 *
 * 递归项在右边，会自然的对应右结合。我们真正需要的是左结合。
 */
public class SimpleCalculator {

    public static void main(String[] args) {
        ...
    }

    /**
     * 执行脚本，并打印输出AST和求值过程。
     * @param script
     */
    public void evaluate(String script){
        ...
    }

    /**
     * 解析脚本，并返回根节点
     * @param code
     * @return
     * @throws Exception
     */
    public ASTNode parse(String code) throws Exception {
        ...
    }

    /**
     * 对某个AST节点求值，并打印求值过程。
     * @param node
     * @param indent  打印输出时的缩进量，用tab控制
     * @return
     */
    private int evaluate(ASTNode node, String indent) {
       ...
    }

    /**
     * 语法解析：根节点
     * @return
     * @throws Exception
     */
    private SimpleASTNode prog(TokenReader tokens) throws Exception {
        ...
    }

    /**
     * 整型变量声明语句，如：
     * int a;
     * int b = 2*3;
     *
     * @return
     * @throws Exception
     */
    private SimpleASTNode intDeclare(TokenReader tokens) throws Exception {
        ...
    }

    /**
     * 语法解析：加法表达式
     * @return
     * @throws Exception
     */
    private SimpleASTNode additive(TokenReader tokens) throws Exception {
        ...
    }

    /**
     * 语法解析：乘法表达式
     * @return
     * @throws Exception
     */
    private SimpleASTNode multiplicative(TokenReader tokens) throws Exception {
        ...
    }

    /**
     * 语法解析：基础表达式
     * @return
     * @throws Exception
     */
    private SimpleASTNode primary(TokenReader tokens) throws Exception {
        ...
    }

    /**
     * 一个简单的AST节点的实现。
     * 属性包括：类型、文本值、父节点、子节点。
     */
    private class SimpleASTNode implements ASTNode {
          ...
    }

    /**
     * 打印输出AST的树状结构
     * @param node
     * @param indent 缩进字符，由tab组成，每一级多一个tab
     */
    private void dumpAST(ASTNode node, String indent) {
        ...
    }
}

SimpleASTNode

对 ASTNode 接口的实现，利用 List 存储必要的数据。

private class SimpleASTNode implements ASTNode {
        SimpleASTNode parent = null;
        List<ASTNode> children = new ArrayList<ASTNode>();
        List<ASTNode> readonlyChildren = Collections.unmodifiableList(children);
        ASTNodeType nodeType = null;
        String text = null;


        public SimpleASTNode(ASTNodeType nodeType, String text) {
            this.nodeType = nodeType;
            this.text = text;
        }

        @Override
        public ASTNode getParent() {
            return parent;
        }

        @Override
        public List<ASTNode> getChildren() {
            return readonlyChildren;
        }

        @Override
        public ASTNodeType getType() {
            return nodeType;
        }

        @Override
        public String getText() {
            return text;
        }

        public void addChild(SimpleASTNode child) {
            children.add(child);
            child.parent = this;
        }

    }

ASTNode 接口实现，用简单的数据结构存储

/**
 * 一个简单的AST节点的实现。
 * 属性包括：类型、文本值、父节点、子节点。
 */
private class SimpleASTNode implements ASTNode {
    SimpleASTNode parent = null;
    List<ASTNode> children = new ArrayList<ASTNode>();
    List<ASTNode> readonlyChildren = Collections.unmodifiableList(children);
    ASTNodeType nodeType = null;
    String text = null;

    public SimpleASTNode(ASTNodeType nodeType, String text) {
        this.nodeType = nodeType;
        this.text = text;
    }

    @Override
    public ASTNode getParent() {
        return parent;
    }

    @Override
    public List<ASTNode> getChildren() {
        return readonlyChildren;
    }

    @Override
    public ASTNodeType getType() {
        return nodeType;
    }

    @Override
    public String getText() {
        return text;
    }

    public void addChild(SimpleASTNode child) {
        children.add(child);
        child.parent = this;
    }

}

parse

对外公开的第一个函数，用于将输入的字符串语句解析为语法树节点。也是类的主要功能函数。

调用 SimpleLexer 解析出 Token 数组

利用 Token 数组生成语法树【调用 prog】

/**
* 解析脚本，并返回根节点
*/
public ASTNode parse(String code) throws Exception {
  SimpleLexer lexer = new SimpleLexer();
  TokenReader tokens = lexer.tokenize(code);

  ASTNode rootNode = prog(tokens);

  return rootNode;
}

prog

此函数创建语法树根节点，并根据自定义的语法规则进行

/**
* 语法解析：根节点
* @return
* @throws Exception
*/
private SimpleASTNode prog(TokenReader tokens) throws Exception {
  SimpleASTNode node = new SimpleASTNode(ASTNodeType.Programm, "Calculator");

  SimpleASTNode child = additive(tokens);

  if (child != null) {
      node.addChild(child);
  }
  return node;
}

语法解析

下面三个函数是解析器的核心算法部分，用于对定义的语法规则进行解析。

additive ：加法表达式解析 `` ```java /**

语法解析：加法表达式
@return

@throws Exception */ private SimpleASTNode additive(TokenReader tokens) throws Exception { SimpleASTNode child1 = multiplicative(tokens); SimpleASTNode node = child1;

Token token = tokens.peek(); if (child1 != null && token != null) {

if (token.getType() == TokenType.Plus || token.getType() == TokenType.Minus) {
    token = tokens.read();
    SimpleASTNode child2 = additive(tokens);
    if (child2 != null) {
        node = new SimpleASTNode(ASTNodeType.Additive, token.getText());
        node.addChild(child1);
        node.addChild(child2);
    } else {
        throw new Exception("invalid additive expression, expecting the right part.");
    }
}

} return node; }

/**

语法解析：乘法表达式
@return

@throws Exception */ private SimpleASTNode multiplicative(TokenReader tokens) throws Exception { SimpleASTNode child1 = primary(tokens); SimpleASTNode node = child1;

Token token = tokens.peek(); if (child1 != null && token != null) {

 if (token.getType() == TokenType.Star || token.getType() == TokenType.Slash) {
     token = tokens.read();
     SimpleASTNode child2 = multiplicative(tokens);
     if (child2 != null) {
         node = new SimpleASTNode(ASTNodeType.Multiplicative, token.getText());
         node.addChild(child1);
         node.addChild(child2);
     } else {
         throw new Exception("invalid multiplicative expression, expecting the right part.");
     }
 }

} return node; }

/**

语法解析：基础表达式
@return

@throws Exception */ private SimpleASTNode primary(TokenReader tokens) throws Exception { SimpleASTNode node = null; Token token = tokens.peek(); if (token != null) {

 if (token.getType() == TokenType.IntLiteral) {
     token = tokens.read();
     node = new SimpleASTNode(ASTNodeType.IntLiteral, token.getText());
 } else if (token.getType() == TokenType.Identifier) {
     token = tokens.read();
     node = new SimpleASTNode(ASTNodeType.Identifier, token.getText());
 } else if (token.getType() == TokenType.LeftParen) {
     tokens.read();
     node = additive(tokens);
     if (node != null) {
         token = tokens.peek();
         if (token != null && token.getType() == TokenType.RightParen) {
             tokens.read();
         } else {
             throw new Exception("expecting right parenthesis");
         }
     } else {
         throw new Exception("expecting an additive expression inside parenthesis");
     }
 }

} return node; //这个方法也做了AST的简化，就是不用构造一个primary节点，直接返回子节点。因为它只有一个子节点。 } ```

【源码分析】编译原理之美手动实现脚本解释器源码分析

概述

项目目录分析

词法分析

语法解析

语义识别

词法分析器

Token.java

TokenType.java

SimpleLexer.java

SimpleToken

SimpleTokenReader

DfaState

tokenize

initToken

辅助函数

ASTNodeType.java

SimpleCalculator.java

SimpleASTNode

parse

prog

语法解析

REPL界面