:::info 本文更新于 2018.12.19 :::
Smallwrite-McClellan bytecode结构
以下是序列化后的smwr database, 未经序列化的database有些偏移不是如此:
+------------------+
| hs_database_t | 32
+------------------+ --------------
| SmallWriteEngine | 64
+------------------+ --------------
| NFA | 64
+------------------+ --------------
| McClellan | 304
+------------------+ --------------
| succ_table | aux_offset-NFA-McClellan smwr_nfa.raw (与NFA8或NFA16有关)
+------------------+ --------------
| aux | 16 * state_count
+------------------+ --------------
| report lists | 与aux中accept等具体匹配情况有关
+------------------+ --------------
模式/a/生成的McClellan8字节码分析
hs_database_t
00000000 db db db db 00 00 06 04 00 02 00 00 00 80 01 00
00000010 00 00 00 00 da 67 fc 9d 00 00 00 00 00 00 00 00
hs_database_t结构体:
/*
* a header to enclose the actual bytecode - useful for keeping info about the
* compiled data.
*/
struct hs_database {
u32 magic;
u32 version;
u32 length;
u64a platform;
u32 crc32;
u32 reserved0;
u32 reserved1;
u32 bytecode; // offset relative to db start
u32 padding[16];
char bytes[];
};
- magic: 0xdbdbdbdb
- version: 0x04060000 4.6.0.0
- length: 0x00000200 = 512 除hs_database首部以外的总长度
- platform: 0x00000000 00018000
- crc32: 9dfc67da
SmallWriteEngine
00000020 23 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SmallWriteEngine结构体, 对齐到64字节:
// Runtime structure header for SmallWrite.
struct ALIGN_CL_DIRECTIVE SmallWriteEngine {
u32 largestBuffer; /**< largest buffer that can be considered small write */
u32 start_offset; /**< where to start scanning in the buffer. */
u32 size; /**< size of the small write engine in bytes (including the nfa) */
};
- largestBuffer: 0x00000023 = 35, 支持的最大匹配长度(原数据长度)
- start_offset: 0, 匹配起始点, 数据长度小于等于此值将直接退出
- size: 0x00000200 = 512 (544 - 32(hs_database) = 512)
:::warning 数据长度超过largestBuffer则不会被匹配, 直接退出 :::
NFA
00000060 00 00 00 00 c0 01 00 00 06 00 00 00 00 00 00 00
00000070 00 00 00 00 03 00 00 00 01 00 00 00 01 00 00 00
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
NFA结构体, 对齐到64字节:
/** \brief header for the NFA implementation. */
struct ALIGN_CL_DIRECTIVE NFA {
u32 flags;
/** \brief The size in bytes of the NFA engine. The engine is
* serialized to the extent that copying length bytes back into a
* 16-byte aligned memory location yields a structure that has the same
* behaviour as the original engine. */
u32 length;
/** \brief Active implementation used by this NFAEngineType */
u8 type;
u8 rAccelType;
u8 rAccelOffset;
u8 maxBiAnchoredWidth; /**< if non zero, max width of the block */
union {
u8 c;
u16 dc;
u8 array[2];
} rAccelData;
u32 queueIndex; /**< index of the associated queue in scratch */
/** \brief The number of valid positions/states for this NFA. Debug only */
u32 nPositions;
/** \brief Size of the state required in scratch space.
*
* This state has less strict size requirements (as it doesn't go in stream
* state) and does not persist between stream writes.
*/
u32 scratchStateSize;
/** \brief Size of the state required in stream state.
*
* This encompasses all state stored by the engine that must persist between
* stream writes. */
u32 streamStateSize;
u32 maxWidth; /**< longest possible match in this NFA, 0 if unbounded */
u32 minWidth; /**< minimum bytes required to match this NFA */
u32 maxOffset; /**< non zero: maximum offset this pattern can match at */
/* Note: implementation (e.g. a LimEx) directly follows struct in memory */
} ;
- flags: 0x00000000
- length: 0x000001c0 = 448
- type: 0x06 = 6, MCCLELLAN_NFA_8
- rAccelType: 0x00
- rAccelOffset: 0x00
- maxBiAnchoredWIdth: 0x00
- rAccelData: 0x0000
- queueIndex: 0x00000000
- nPositions: 0x00000003
- scratchStateSize: 0x00000001
- streamStateSize: 0x00000001
- maxWidth: 0x00000000 inf
- minWidth: 0x00000000 0
- maxOffset: 0x00000000 inf
type字段代表的是NFA类型, 见 src/nfa/nfa_internal.h
, NFAEngineType枚举. 其中:
- MCCLELLAN_NFA_8 = 6
- MCCLELLAN_NFA_16 = 7
McClellan
000000a0 03 00 00 00 c0 01 00 00 01 00 01 00 80 01 00 00
000000b0 00 00 00 00 00 00 00 00 02 00 02 00 00 00 01 01
000000c0 00 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000000d0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000000e0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000000f0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000100 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000110 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000120 01 01 00 01 01 01 01 01 01 01 01 01 01 01 01 01
00000130 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000140 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000150 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000160 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000170 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000180 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
00000190 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000001a0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000001b0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
000001c0 01 00 00 00 00 00 00 00 80 01 00 00 00 00 00 00
mcclellan结构体:
struct mcclellan {
u16 state_count; /**< total number of states */
u32 length; /**< length of dfa in bytes */
u16 start_anchored; /**< anchored start state */
u16 start_floating; /**< floating start state */
u32 aux_offset; /**< offset of the aux structures relative to the start of
* the nfa structure */
u32 sherman_offset; /**< offset of array of sherman state offsets the
* state_info structures relative to the start of the
* nfa structure */
u32 sherman_end; /**< offset of the end of the state_info structures
* relative to the start of the nfa structure */
u16 accel_limit_8; /**< 8 bit, lowest accelerable state */
u16 accept_limit_8; /**< 8 bit, lowest accept state */
u16 sherman_limit; /**< lowest sherman state */
u8 alphaShift;
u8 flags;
u8 has_accel; /**< 1 iff there are any accel plans */
u8 remap[256]; /**< remaps characters to a smaller alphabet */
ReportID arb_report; /**< one of the accepts that this dfa may raise */
u32 accel_offset; /**< offset of the accel structures from start of NFA */
u32 haig_offset; /**< reserved for use by Haig, relative to start of NFA */
};
- state_count: 0x0003 3个state
- length: 0x000001c0 = 448
- start_anchored: 0x0001 = 1
- start_floating: 0x0001 = 1
- aux_offset: 0x00000180 = 384
- sherman_offset: 0x00000000
- sherman_end: 0x00000000
- accel_limit_8: 0x0002 = 2
- accept_limit_8: 0x0002 = 2
- sherman_limit: 0x0000
- alphaShift: 0x01
- flags: 0x1
- has_accel: 0
- remap[256]: 256字节 其中remap[97] = 0, 对应的是字符’a’, 其余全是1
- arb_report: 0x00000000
- accel_offset: 0x00000180 = 384
- haig_offset: 0x00000000
NFA和McClellan结构体一共是64+304 = 368字节, aux_offset是384, 意味着McClellan和aux之间还有384-368 = 16字节的succ_table区.
succ_table
000001d0 00 00 02 01 02 01 00 00 00 00 00 00 00 00 00 00
状态转换表, 内部称为succ_table. 其实succ_table还不是直接能跳转的转换表, 需要有一个转换操作
trans_table[c] = succ_table[(s<<shift) + remap[c]]
其中c是条件, s是当前状态, shift是mcclellan->alphaShift, remap是 mcclellan->remap.
转换表共有257维, 不是256! 其中trans_table[256] = aux->top & STATE_MASK,
STATE_MASK = 0x3fff
状态1:
trans[0] = succ[1<<1 + remap[0]] = succ[2 + 1] = 1
trans['a'] = succ[1<<1 + remap['a']] = succ[2+0] = 2
trans[256] = aux[1]->top & 0x3fff = 1
状态2:
trans[0] = succ[2<<1 + remap[0]] = succ[4+1] = 1
trans['a'] = succ[2<<1 + remap['a']] = succ[4+0] = 2
trans[256] = aux[2]->top & 0x3fff = 2
对于McClellan8, 状态数小于2^8, 因此succ_table每一项占用的空间是8bit=1byte.
// src/nfa/mcclellan.c, doNormal8
const u8 *succ_table = (const u8 *)((const char *)m
+ sizeof(struct mcclellan));
而对于McClellan16, succ_table每一项则占用16bit=2bytes空间.
// src/nfa/mcclellan.c, doNormal16
const u16 *succ_table = (const u16 *)((const char *)m
+ sizeof(struct mcclellan));
:::warning 目前smwr分支最多只支持65535个状态 :::
aux
000001e0 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
000001f0 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
00000200 b0 01 00 00 00 00 00 00 02 00 00 00 00 00 00 00
48字节aux区, aux数即mstate_aux结构体, 它的数量与state数量相同
struct mstate_aux {
u32 accept;
u32 accept_eod;
u16 top;
u32 accel_offset;
/**< relative to start of struct mcclellan; 0 if no accel */
};
aux0:
accept: 0x00000000 = 0
accept_eod: 0x00000000 = 0
top: 0x0001 = 1
accel_offset: 0x00000000 = 0
aux1:
accept: 0x00000000 = 0
accept_eod: 0x00000000 = 0
top: 0x0001 = 1
accel_offset: 0x00000000 = 0
aux2:
accept: 0x000001b0 = 432
accept_eod: 0x00000000 = 0
top: 0x0002 = 2
accel_offset: 0x00000000 = 0
report lists
00000210 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
最后是report list区, 即我之前算法中的match id区, 存储模式串的id,
这里存储的是多个report_list结构体:
struct report_list {
u32 count;
ReportID report[];
};
第1个32位整数为 0x00000001 = 1, ReportID(32位整数)是0, 对应模式串/a/.
这个report_list相对于整个NFA的偏移 = 64(NFA) + 304(mcclellan) + 16(succ_table) + 48(aux) = 432, 正好对应状态2的aux->accept
本作品采用知识共享署名-非商业性使用-禁止演绎 3.0 未本地化版本许可协议进行许可。