基础预训练简介

基础预训练简介

基础预训练是加载已有的模型参数，使用收集到的预训练文本对加载的模型进行继续预训练continue pretraining。继续预训练由于模型参数是加载已经训练过的参数进行初始化，对计算资源和数据资源要求相对较低。

环境准备

设置环境变量，并且下载示例数据集

export CUDA_VISIBLE_DEVICES=$1
cd ./examples/appzoo_tutorials/language_modeling/
if [ ! -f ./train.json ]; then
  wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/language_modeling/train.json
fi
if [ ! -f ./dev.json ]; then
  wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/language_modeling/dev.json
fi
MASTER_ADDR=localhost
MASTER_PORT=6009
GPUS_PER_NODE=1
NNODES=1
NODE_RANK=0
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

继续预训练算法的训练

python -m torch.distributed.launch $DISTRIBUTED_ARGS main.py \
    --mode=train \
    --worker_gpu=1 \
    --tables=train.json,dev.json \
    --learning_rate=1e-4  \
    --epoch_num=3  \
    --logging_steps=100 \
    --save_checkpoint_steps=100 \
    --sequence_length=128 \
    --train_batch_size=2 \
    --checkpoint_dir=./lm_models \
    --app_name=language_modeling \
    --user_defined_parameters='
        pretrain_model_name_or_path=bert-base-chinese
    '

继续预训练算法的下游任务微调

继续预训练出来的模型可以当作原本加载的预训练模型使用。例如，如果是加载的BERT模型进行的继续预训练，那么继续预训练保存的模型仍然可以通过类似于BERT做Fine-tuning的流程使用。只需要将下载模型的的.easynlp文件夹中对应的pytorch_model.bin使用继续预训练保存的pytorch_model.bin文件进行替换即可。

train: pretrain_model_name_or_path=bert-base-chinese 将指定加载的bert-base-chinese模型中的pytorch_model.bin替换成继续预训练出来的pytorch_model.bin即可。

easynlp \
    --mode=train \
    --worker_gpu=1 \
    --tables=train.tsv,dev.tsv \
    --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
    --first_sequence=sent1 \
    --second_sequence=sent2 \
    --label_name=label \
    --label_enumerate_values=0,1 \
    --checkpoint_dir=./classification_model \
    --learning_rate=3e-5  \
    --epoch_num=3  \
    --random_seed=42 \
    --save_checkpoint_steps=50 \
    --sequence_length=128 \
    --micro_batch_size=32 \
    --app_name=text_classify \
    --user_defined_parameters='
        pretrain_model_name_or_path=bert-base-chinese
    '

evaluate & predict 与BERT运行流程一致

easynlp \
    --mode=evaluate \
    --worker_gpu=1 \
    --tables=dev.tsv \
    --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
    --first_sequence=sent1 \
    --second_sequence=sent2 \
    --label_name=label \
    --label_enumerate_values=0,1 \
    --checkpoint_path=./classification_model \
    --micro_batch_size=32 \
    --sequence_length=128 \
    --app_name=text_classify

easynlp \
    --mode=predict \
    --worker_gpu=1 \
    --tables=dev.tsv \
    --outputs=dev.pred.tsv \
    --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
    --output_schema=predictions,probabilities,logits,output \
    --append_cols=label \
    --first_sequence=sent1 \
    --second_sequence=sent2 \
    --checkpoint_path=./classification_model \
    --micro_batch_size=32 \
    --sequence_length=128 \
    --app_name=text_classify

EasyNLP中文文档

基础预训练实践

基础预训练简介

环境准备

继续预训练算法的训练

继续预训练算法的下游任务微调