如何使用fairseq跑WMT14 EN-DE翻译 - 《机器翻译模型工具使用方法》

前言

前言

本文以训练Transformer-based机器翻译模型为例，通过WMT2014英德实验来介绍fairseq工具的使用方法。

预安装工具或脚本：

1. fairseq

由于直接使用pip安装的fairseq版本（0.9.0）还停留在2019年12月，为了使用更新的特性，我们选择GitHub上的最新版本：

git clone https://github.com/pytorch/fairseq.git 
cd fairseq
pip install --editable ./

2. mosesdecoder

git clone https://github.com/moses-smt/mosesdecoder.git

3. subword-nmt

git clone https://github.com/rsennrich/subword-nmt.git

4. 评测脚本

multi-bleu.perl 位于mosesdecoder/scripts/generic/ 路径下
sacreBleu

1. 数据准备:

脚本下载：

使用fairseq提供的脚本进行数据的下载

bash fairseq/examples/translation/prepare-wmt14en2de.sh

这个脚本会下载IWSLT 14 英语和德语的平行数据，并进行分词、BPE等操作，处理的结果为：

wmt17_en_de
├── code
├── test.de
├── test.en
├── tmp
├── train.de
├── train.en
├── valid.de
└── valid.en

手动下载：

1.1 数据下载及处理

以de-en为例，下载其平行语料数据。
下载训练集数据 http://www.statmt.org/wmt14/translation-task.html
包括三部分数据：Europarl v7、Common Crawl corpus、News Commentary。将三部分数据中的de-en.en和de-en.de文件分别合并train_raw.en(4520620行)和train_raw.de（4520620行）
使用命令清洗训练数据，$mosesdecoder 为 clean-corpus-n.perl文件所在目录

perl $mosesdecoder/clean-corpus-n.perl -ratio 1.5 train_raw en de train 1 250

得到清洗后的训练数据 train.en (3907123）
train.de (3907123)

下载开发集数据 http://data.statmt.org/wmt17/translation-task/dev.tgz
采用dev/newstest2013-src.en.sgm 和 dev/newstest2013-src.de.sgm为开发集

grep '<seg id' newstest2013-src.en.sgm | sed -e 's/<seg id="[0-9]*">\s*//g' | sed -e 's/\s*<\/seg>\s*//g' | sed -e "s/\’/\'/g" > valid.en

同样处理de文件，将生成如下两个文件
valid.en (3000行)
valid.de (3000行)

下载测试集数据 http://statmt.org/wmt14/test-full.tgz
以test-full/newstest2014-deen-src.en.sgm文件为例

grep '<seg id' newstest2014-deen-src.en.sgm | sed -e 's/<seg id="[0-9]*">\s*//g' | sed -e 's/\s*<\/seg>\s*//g' | sed -e "s/\’/\'/g" > test.en

同样处理de文件，将生成如下两个文件
test.en (3003行)
test.de (3003行)

1.2 分词

$moses = mosesdecoder/scripts/tokenizer
$file1-6 分别为 train.en train.de valid.en valid.de test.en test.de

cat $file1 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl en | perl $moses/tokenizer.perl -threads 8 -a -l en >> train.tok.en -l

cat $file2 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl de | perl $moses/tokenizer.perl -threads 8 -a -l de >> train.tok.de -l

cat $file3 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl en | perl $moses/tokenizer.perl -threads 8 -a -l en >> valid.tok.en -l

cat $file4 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl de | perl $moses/tokenizer.perl -threads 8 -a -l de >> valid.tok.de -l

cat $file5 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl en | perl $moses/tokenizer.perl -threads 8 -a -l en >> test.tok.en -l

cat $file6 | perl $moses/normalize-punctuation.perl | perl $moses/remove-non-printing-char.perl de | perl $moses/tokenizer.perl -threads 8 -a -l de >> test.tok.de -l

经过以上操作后，将在train.en等文件所在的目录中生成train.tok.en train.tok.de valid.tok.en valid.tok.de test.tok.en test.tok.de等文件

1.3 bpe操作

bpe操作有个重要的参数bpe_size，为将数据集划分为字词的最大个数，具体设置与数据集有关系，在这里我设置$bpe_size = 30000
由于英德同源，所以这里生成一个的bpe分词文件。
$subword = subword-nmt/subword_nmt

cat train.en.tok train.de.tok | python $subword/learn_bpe.py -s $bpe_size > train.bpe

生成bpe文件之后才可以继续分词，使用方法为：

python $subword/apply_bpe.py -c train.bpe < $data_dir/train.tok.en > new_bpe/train.bpe.en

python $subword/apply_bpe.py -c train.bpe < $data_dir/train.tok.de > new_bpe/train.bpe.de

python $subword/apply_bpe.py -c train.bpe < $data_dir/valid.tok.en > new_bpe/valid.bpe.en

python $subword/apply_bpe.py -c train.bpe < $data_dir/valid.tok.de > new_bpe/valid.bpe.de

python $subword/apply_bpe.py -c train.bpe < $data_dir/test.tok.en > new_bpe/test.bpe.en

python $subword/apply_bpe.py -c train.bpe < $data_dir/test.tok.de > new_bpe/test.bpe.de

完成上述操作后，将在$data_dir目录下生成 train.bpe.entrain.bpe.de valid.bpe.en valid.bpe.de test.bpe.en test.bpe.de六个文件

2. 数据预处理

数据二进制化：

使用fairseq-preprocess命令将文本数据转为二进制文件
$data 为第1步中处理好的数据所在目录

fairseq-preprocess --source-lang de --target-lang en --trainpref $data/train --validpref $data/valid --testpref $data/test

预处理命令首先会从训练文本数据中构建词表。在默认情况下，会将所有出现过的单词根据词频排序，并将这个排序后的单词列表作为最终的词表。默认保存位置为/fairseq/data-bin

3. 模型训练

$gpu_id为使用的GPU编号，训练所得模型默认保存至 /fairseq/checkpoints 目录下。
fairseq-train提供了大量的训练参数，从而进行定制化的训练过程，其中主要的参数可以分为数据（data）、模型（model）、优化（optimizing）、训练（分布式和多GPU等）、日志（log）和模型保存（checkpointing）等。详细参数设置参照文档 https://fairseq.readthedocs.io/en/latest/command_line_tools.html

CUDA_VISIBLE_DEVICES=$gpu_id fairseq-train data-bin --arch transformer --max-tokens 4096 --max-update 30000 --optimizer adam --lr-scheduler inverse_sqrt --lr 0.0007 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --no-progress-bar --save-interval-updates 1000 1>log/translate.log 2>&1 &

4. 解码

4.1 生成式解码 fairseq-generate

$data-bin 为二进制化处理结果目录
$checkpoints 为训练结果目录
$res 为解码结果要保存到的目录

fairseq-generate $data-bin --path $checkpoints/checkpoint_best.pt --batch-size 128 --beam 8 > $res/predict_test.txt

默认情况下，这个命令会从预处理的数据中，解码测试数据。通过—gen-subset可以指定解码其他部分，如—gen-subset train就会翻译整个训练数据。可以添加—quiet参数，只显示翻译进度和最后打分。可以添加—remove-bpe参数，使得在生成时就去掉bpe符号(@@)。

4.2 交互式解码 fairseq-interactive

$data 为要解码的数据所在目录
$checkpoints 为训练结果目录
fairseq-interactive可以进行交互式逐句解码，其参数和fairseq-generate基本一致。例如逐行翻译test.de文件中的句子：

cat $data/test.de | fairseq-interactive data-bin --path $checkpoints/checkpoint_best.pt --remove-bpe

5. 译文处理

5.1 将得到的译文进行抽取处理

grep ^H predict_test.txt | cut -f3- > predict_test.txt

5.2 去除bpe符号

如果在解码过程中未使用—remove-bpe参数，则可以在抽取译文后删除bpe符号

sed -r 's/(@@ )| (@@ ?$)//g' < predict_test.txt  > predict_test_rs.txt

5.3 评价

perl /mosesdecoder/scripts/generic/multi-bleu.perl test.en < predict_test_rs.tran

参考资料

https://www.yuque.com/docs/share/5d13c886-8479-4b7b-961e-e9eb84a9f5bd?#
https://zhuanlan.zhihu.com/p/194176917