- We are releasing code to do “masked LM” and “next sentence prediction” on an arbitrary text corpus. Note that this is not the exact code that was used for the paper (the original code was written in C++, and had some additional complexity), but this code does generate pre-training data as described in the paper.
- Here’s how to run the data generation. The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the “next sentence prediction” task). Documents are delimited by empty lines. The output is a set of
tf.train.Examples
serialized intoTFRecord
file format. - You can perform sentence segmentation with an off-the-shelf NLP toolkit such as
spaCy
. Thecreate_pretraining_data.py
script will concatenate segments until they reach the maximum sequence length to minimize computational waste from padding (see the script for more details). However, you may want to intentionally add a slight amount of noise to your input data (e.g., randomly truncate 2% of input segments) to make it more robust to non-sentential input during fine-tuning.spaCy
:https://spacy.io/
- This script stores all of the examples for the entire input file in memory, so for large data files you should shard the input file and call the script multiple times. (You can pass in a file glob to
run_pretraining.py
, e.g.,tf_examples.tf_record*
.) The
max_predictions_per_seq
is the maximum number of masked LM predictions per sequence. You should set this to aroundmax_seq_length
*masked_lm_prob
(the script doesn’t do that automatically because the exact value needs to be passed to both scripts).python create_pretraining_data.py \
--input_file=./sample_text.txt \
--output_file=/tmp/tf_examples.tfrecord \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--do_lower_case=True \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
Here’s how to run the pre-training. Do not include
init_checkpoint
if you are pre-training from scratch. The model configuration (including vocab size) is specified inbert_config_file
. This demo code only pre-trains for a small number of steps (20), but in practice you will probably want to setnum_train_steps
to 10000 steps or more. Themax_seq_length
andmax_predictions_per_seq
parameters passed torun_pretraining.py
must be the same ascreate_pretraining_data.py
.python run_pretraining.py \
--input_file=/tmp/tf_examples.tfrecord \
--output_dir=/tmp/pretraining_output \
--do_train=True \
--do_eval=True \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--learning_rate=2e-5
This will produce an output like this:
***** Eval results *****
global_step = 20
loss = 0.0979674
masked_lm_accuracy = 0.985479
masked_lm_loss = 0.0979328
next_sentence_accuracy = 1.0
next_sentence_loss = 3.45724e-05
Note that since our
sample_text.txt
file is very small, this example training will overfit that data in only a few steps and produce unrealistically high accuracy numbers.1、Pre-training tips and caveats
If using your own vocabulary, make sure to change
**vocab_size**
in**bert_config.json**
. If you use a larger vocabulary without changing this, you will likely get NaNs when training on GPU or TPU due to unchecked out-of-bounds access.- If your task has a large domain-specific corpus available (e.g., “movie reviews” or “scientific papers”), it will likely be beneficial to run additional steps of pre-training on your corpus, starting from the BERT checkpoint.
- The learning rate we used in the paper was 1e-4. However, if you are doing additional steps of pre-training starting from an existing BERT checkpoint, you should use a smaller learning rate (e.g., 2e-5).
- Longer sequences are disproportionately expensive because attention is quadratic([kwɑːˈdrætɪk]平方的; 二次方的)to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128.
- The fully-connected/convolutional cost is the same, but the attention cost is far greater for the 512-length sequences. Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps with a sequence length of 512. The very long sequences are mostly needed to learn positional embeddings, which can be learned fairly quickly.
- Note that this does require generating the data twice with different values of
max_seq_length
.
If you are pre-training from scratch, be prepared that pre-training is computationally expensive, especially on GPUs.
- If you are pre-training from scratch, our recommended recipe is to pre-train a
BERT-Base
on a single preemptible Cloud TPU v2(https://cloud.google.com/tpu/docs/pricing), which takes about 2 weeks at a cost of about $500 USD (based on the pricing in October 2018). You will have to scale down(缩小; 缩减; 减弱)the batch size when only training on a single Cloud TPU, compared to what was used in the paper. It is recommended to use the largest batch size that fits into TPU memory.2、Pre-training data
- If you are pre-training from scratch, our recommended recipe is to pre-train a
We will not be able to release the pre-processed datasets used in the paper.
- For Wikipedia, the recommended pre-processing is to download the latest dump(https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with WikiExtractor.py(https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text.
- Unfortunately the researchers who collected the BookCorpus(http://yknzhu.wixsite.com/mbweb)no longer have it available for public download. The Project Guttenberg Dataset(https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html)is a somewhat smaller (200M word) collection of older books that are public domain.
- Common Crawl(https://commoncrawl.org/)is another very large collection of text, but you will likely have to do substantial pre-processing and cleanup to extract a usable corpus for pre-training BERT.
3、Learning a new WordPiece vocabulary
This repository does not include code for learning a new WordPiece vocabulary. The reason is that the code used in the paper was implemented in C++ with dependencies on Google’s internal libraries. For English, it is almost always better to just start with our vocabulary and pre-trained models.
- For learning vocabularies of other languages, there are a number of open source options available. However, keep in mind that these are not compatible with our
tokenization.py
library:- Google’s SentencePiece library
- tensor2tensor’s WordPiece generation script
- Rico Sennrich’s Byte Pair Encoding library