训练营作业 - W02-H1 正则加Encoding函数 - 《极客大学-前端进阶训练营学习笔记》

1. 写一个正则表达式匹配所有 Number 直接量
2. 写一个 UTF-8 Encoding 的函数
3. 写一个正则表达式，匹配所有的字符串直接量，单引号和双引号
参考资料：

写一个正则表达式匹配所有 Number 直接量
写一个 UTF-8 Encoding 的函数
写一个正则表达式，匹配所有的字符串直接量，单引号和双引号

1. 写一个正则表达式匹配所有 Number 直接量

S1 明确目标：

首先查找 ECMA-262.pdf P166 找到number的定义：

Numeric Literals Syntax NumericLiteral :: DecimalLiteral BinaryIntegerLiteral OctalIntegerLiteral HexIntegerLiteral

DecimalLiteral :: DecimalIntegerLiteral . DecimalDigits ExponentPart . DecimalDigits ExponentPart DecimalIntegerLiteral ExponentPart
DecimalIntegerLiteral :: 0 NonZeroDigit DecimalDigits

DecimalDigits :: DecimalDigit DecimalDigits DecimalDigit

DecimalDigit :: one of 0 1 2 3 4 5 6 7 8 9

NonZeroDigit :: one of 1 2 3 4 5 6 7 8 9

ExponentPart :: ExponentIndicator SignedInteger

ExponentIndicator :: one of e E

SignedInteger :: DecimalDigits

DecimalDigits

DecimalDigits

BinaryIntegerLiteral :: 0b BinaryDigits 0B BinaryDigits

BinaryDigits :: BinaryDigit BinaryDigits BinaryDigit

BinaryDigit :: one of 0 1

OctalIntegerLiteral :: 0o OctalDigits 0O OctalDigits

OctalDigits :: OctalDigit OctalDigits OctalDigit

OctalDigit :: one of 0 1 2 3 4 5 6 7

HexIntegerLiteral :: 0x HexDigits 0X HexDigits

HexDigits :: HexDigit HexDigits HexDigit

HexDigit :: one of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The SourceCharacter immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit. NOTE For example: 3in is an error and not the two input elements 3 and in. A conforming implementation, when processing strict mode code, must not extend, as described in B.1.1, the syntax of NumericLiteral to include LegacyOctalIntegerLiteral, nor extend the syntax of DecimalIntegerLiteral to include NonOctalDecimalIntegerLiteral.

S2. 翻译成人话

Number 直接量一共有四种组成，十进制、二进制、八进制、十六进制

对于十进制的数来说，有三种组合方法：
- [十进制整数].<十进制数字串>（可选）<指数部分>（可选）
- . [十进制数字串]<指数部分>（可选）
- [十进制整数]<指数部分>（可选）

其中，
十进制数字串（DecimalDigits）指由一个或多个“0 1 2 3 4 5 6 7 8 9”组成的，最前端可以为零
十进制整数（DecimalIntegerLiteral ）指 0 或者由非零数字开头后接0个或者多个十进制数字。
指数部分 有指数标识符和有符号的整数组成，
有效的指数标识符包括“e E”，
有符号的整数有三种 十进制数字串/+十进制数字串/-十进制数字串

对于二进制的数来说，要求使用 0b 或者 0B 开头，后面接一位或者多位由0和1中的一个组成的二进制串
对于八进制的数来说，要求使用 0o 或者 0O 开头，后面接一位或者多位由“0 1 2 3 4 5 6 7”中的一个组成的八进制串
对于十六进制的数来说，要求使用 0x 或者 0X 开头，后面接一位或者多位由“0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F”中的一个组成的八进制串

S3.折腾成“鬼话”

先从简单的入手，逐个击破：

二进制数： ^(0b|0B)(0|1)+$
八进制数：^(0o|0O)[0-7]+$
十六进制数：^(0x|0X)[0-9a-fA-F]+$
十进制数：
1. [十进制整数].<十进制数字串>（可选）<指数部分>（可选）：

^(0|([1-9]\d*)).(\d*)?((e|E){1}(\+|\-)?\d+)?$

. [十进制数字串]<指数部分>（可选）：^.\d+((e|E){1}(\+|\-)?\d+)?$
[十进制整数]<指数部分>（可选）：^(0|([1-9]\d*))((e|E){1}(\+|\-)?\d+)?$

之后使用或关系连接起来：

(^(0b|0B)(0|1)+$)|(^(0o|0O)[0-7]+$)|(^(0x|0X)[0-9a-fA-F]+$)|(^(0|([1-9]\d*)).(\d*)?((e|E){1}(\+|\-)?\d+)?$)|(^.\d+((e|E){1}(\+|\-)?\d+)?$)|(^(0|([1-9]\d*))((e|E){1}(\+|\-)?\d+)?$)

2. 写一个 UTF-8 Encoding 的函数

S1. 整理一下需求：

写一个函数，把上面的字符串转换成下面的乱码。

S2. 整理一下UTF8编码时的规则

UTF-8（8-bit Unicode Transformation Format）是一种针对Unicode的可变长度字符编码，也是一种前缀码。它可以用一至四个字节对Unicode字符集中的所有有效编码点进行编码，属于Unicode标准的一部分，由于较小值的编码点一般使用频率较高，直接使用Unicode编码效率低下，大量浪费内存空间。UTF-8就是为了解决向后兼容ASCII码而设计，Unicode中前128个字符（与ASCII码一一对应），使用与ASCII码相同的二进制值的单个字节进行编码，这使得原来处理ASCII字符的软件无须或只须做少部分修改，即可继续使用。因此，它逐渐成为电子邮件、网页及其他存储或发送文字优先采用的编码方式。

UTF-8的设计有以下的多字符组序列的特质：

单字节字符的最高有效比特永远为0。
多字节序列中的首个字符组的几个最高有效比特决定了序列的长度。最高有效位为110的是2字节序列，而1110的是三字节序列，如此类推。
多字节序列中其余的字节中的首两个最高有效比特为10。

Unicode 和 UTF-8 之间的转换关系表 ( **x** 字符表示码点占据的位 )

码点位数	码点起值	码点终值	字节序列	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	0xxxxxxx
11	U+0080	U+07FF	2	110xxxxx	10xxxxxx
16	U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+200000	U+3FFFFFF	5	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+4000000	U+7FFFFFFF	6	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

S3. 设计思路

遍历字符串，通过JS内置函数codePointAt()获得整个字符串码点数组。
遍历码点数组，逐个比对码点落入的区间范围，判断对应的字节序列的数量。
针对每一个区间的情况，编写位运算函数，获得我们需要的字符编码
转换成十六进制字符串输出
S4. 自己尝试完成一个版本

S5. 比对Github上已有的项目

https://github.com/mathiasbynens/utf8.js/blob/master/utf8.js

3. 写一个正则表达式，匹配所有的字符串直接量，单引号和双引号

S1 明确目标：

首先查找 ECMA-262.pdf P169 找到String Literals的定义：

Syntax StringLiteral :: “ DoubleStringCharacters “ ‘ SingleStringCharacters ‘

DoubleStringCharacters :: DoubleStringCharacter DoubleStringCharacters
SingleStringCharacters :: SingleStringCharacter SingleStringCharacters
DoubleStringCharacter :: SourceCharacter but not one of “ or \ or LineTerminator

\ EscapeSequence LineContinuation

SingleStringCharacter :: SourceCharacter but not one of ‘ or \ or LineTerminator

\ EscapeSequence LineContinuation

LineContinuation :: \ LineTerminatorSequence

LineTerminatorSequence ::

[lookahead ≠ ]

EscapeSequence :: CharacterEscapeSequence 0 [lookahead ∉ DecimalDigit] HexEscapeSequence UnicodeEscapeSequence

A conforming implementation, when processing strict mode code, must not extend the syntax of EscapeSequence to include LegacyOctalEscapeSequence as described in B.1.2.

CharacterEscapeSequence :: SingleEscapeCharacter NonEscapeCharacter

SingleEscapeCharacter :: one of ‘ “ \ b f n r t v

NonEscapeCharacter :: SourceCharacter but not one of EscapeCharacter or LineTerminator

EscapeCharacter :: SingleEscapeCharacter DecimalDigit x u

HexEscapeSequence :: x HexDigit HexDigit

UnicodeEscapeSequence :: u Hex4Digits u{ CodePoint }

Hex4Digits :: HexDigit HexDigit HexDigit HexDigit

CodePoint :: HexDigits but only if MV of HexDigits ≤ 0x10FFFF

The definition of the nonterminal HexDigit is given in 11.8.3. SourceCharacter is defined in 10.1.

SourceCharacter :: any Unicode code point

If the phrase “[lookahead ∉ set]” appears in the right-hand side of a production, it indicates that the production may not be used if the immediately following input token sequence is a member of the given set. The set can be written as a comma separated list of one or two element terminal sequences enclosed in curly brackets. For convenience, the set can also be written as a nonterminal, in which case it represents the set of all terminals to which that nonterminal could expand. If the set consists of a single terminal the phrase “[lookahead ≠ terminal]” may be used.

For example, given the definitions:

DecimalDigit :: one of 0 1 2 3 4 5 6 7 8 9

DecimalDigits :: DecimalDigit DecimalDigits DecimalDigit the definition:

LookaheadExample :: n [lookahead ∉ { 1 , 3 , 5 , 7 , 9 }] DecimalDigits DecimalDigit [lookahead ∉ DecimalDigit]

matches either the letter n followed by one or more decimal digits the first of which is even, or a decimal digit not followed by another decimal digit. Similarly, if the phrase “[lookahead ∈ set]” appears in the right-hand side of a production, it indicates that the production may only be used if the immediately following input token sequence is a member of the given set. If the set consists of a single terminal the phrase “[lookahead = terminal]” may be used.

S2 翻译成人话

字符串直接量有两种形式：单引号里面的，和双引号里面。单引号或者双引号面的字符可以为空。

对于双引号里面可以使用的字符有如下的规定：
- 除了单引号'、转义符\和各种换行符（）的所有元字符（即任意的unicode码点）
- \ 开头的转义序列（EscapeSequence）
- 换行的转义字符(LineContinuation)
对于单引号也是同理，只是第一条里面是除了双引号"，规定如下：
- 除了双引号"、转义符\和各种换行符（）的所有元字符（即任意的unicode码点）
- \ 开头的转义序列（EscapeSequence）
- 换行的转义字符(LineContinuation)

其中，

换行的转义字符(LineContinuation)是指，由\开头，后面跟随<LF>、后面没有跟随<LF>的<CR>、<LS>、<PS>、<CR><LF> 这五种情况。
转义序列（EscapeSequence）包括：
- 后面不跟随十进制数字的0，即\0
- 转义序列字符(CharacterEscapeSequence)：
  1. 一种是单独的转义字符，包括' " \ b f n r t v
  2. 另一种就是不需要转义的字符，即除了单独的转义字符、十进制的数字、x、u和所有相关的款式的换行符。
- 十六进制的转义序列(HexEscapeSequence)：x加上十六进制的字符串，即\x
- Unicode的转义序列(UnicodeEscapeSequence)：有两种写法:
  1. 一种是\u+四个十六进制数，如\u00A0
  2. 另一种是\u+{码点}，如\u{79}

W02-H1 正则加Encoding函数

1. 写一个正则表达式 匹配所有 Number 直接量

S1 明确目标：

S2. 翻译成人话

S3.折腾成“鬼话”

2. 写一个 UTF-8 Encoding 的函数

S1. 整理一下需求：

S2. 整理一下UTF8编码时的规则

S3. 设计思路

S4. 自己尝试完成一个版本

S5. 比对Github上已有的项目

3. 写一个正则表达式，匹配所有的字符串直接量，单引号和双引号

S1 明确目标：

S2 翻译成人话

S3 折腾成“鬼话”

参考资料：

1. 写一个正则表达式匹配所有 Number 直接量