1、Pre-trained models

  • We are releasing the BERT-Base and BERT-Large models from the paper.
  • Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers.
    • Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).
  • Cased means that the true case and accent markers are preserved.
    • When using a cased model, make sure to pass **--do_lower=False** to the training scripts. (Or pass **do_lower_case=False** directly to **FullTokenizer** if you’re using your own script.)
  • Each .zip file contains three items:
    • A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).
    • A vocab file (vocab.txt) to map WordPiece to word id.
    • A config file (bert_config.json) which specifies the hyperparameters of the model.
  • These models are all released under the same license as the source code (Apache 2.0).
  • For information about the Multilingual and Chinese model, see:

  • 参考论文:

    • 【2019-08-25】Well-read students learn better:On the importance of Pre-Training compact models

      1)模型下载

  • The standard BERT recipe([ˈresəpi] 配方; 食谱; 秘诀; 方法; 烹饪法; 诀窍;)(including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.

  • Goal: to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity.
  • This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking), You can download all 24 from:
  • Or download individually from the table below: |
    | H=128 | H=256 | H=512 | H=768 | | —- | —- | —- | —- | —- | | L=2 | 2/128 (BERT-Tiny) | 2/256 | 2/512 | 2/768 | | L=4 | 4/128 | 4/256 (BERT-Mini) | 4/512 (BERT-Small) | 4/768 | | L=6 | 6/128 | 6/256 | 6/512 | 6/768 | | L=8 | 8/128 | 8/256 | 8/512 (BERT-Medium) | 8/768 | | L=10 | 10/128 | 10/256 | 10/512 | 10/768 | | L=12 | 12/128 | 12/256 | 12/512 | 12/768 (BERT-Base) |

    • Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model.

      2)不同模型的 GLUE 得分

  • Here are the corresponding GLUE scores on the test set: | Model | Score | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI-m | MNLI-mm | QNLI(v2) | RTE | WNLI | AX | | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | | BERT-Tiny | 64.2 | 0.0 | 83.2 | 81.1/71.1 | 74.3/73.6 | 62.2/83.4 | 70.2 | 70.3 | 81.5 | 57.2 | 62.3 | 21.0 | | BERT-Mini | 65.8 | 0.0 | 85.9 | 81.1/71.8 | 75.4/73.3 | 66.4/86.2 | 74.8 | 74.3 | 84.1 | 57.9 | 62.3 | 26.1 | | BERT-Small | 71.2 | 27.8 | 89.7 | 83.4/76.2 | 78.8/77.0 | 68.1/87.0 | 77.6 | 77.0 | 86.4 | 61.8 | 62.3 | 28.6 | | BERT-Medium | 73.5 | 38.0 | 89.6 | 86.6/81.6 | 80.4/78.4 | 69.6/87.9 | 80.0 | 79.1 | 87.7 | 62.2 | 62.3 | 30.5 |

  • For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:

    • batch sizes: 8, 16, 32, 64, 128
    • learning rates: 3e-4, 1e-4, 5e-5, 3e-5

      【2019-05-31】Whole Word Masking Models

  • This is a release of several new models which were the result of an improvement the pre-processing code.

  • In the original pre-processing code, we randomly select WordPiece tokens to mask. The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same. For example: ``` Input Text: the man jumped up , put his basket on phil ##am ##mon ‘ s head

Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ‘ s head

Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ‘ s head ```

  • The training is identical:we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too ‘easy’ for words that had been split into multiple WordPieces.
  • This can be enabled during data generation by passing the flag --do_whole_word_mask=True to create_pretraining_data.py.
  • Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models.

【2019-02-07】TfHub Module

  • BERT has been uploaded to TensorFlow Hub. See run_classifier_with_tfhub.py(参考官网 github) for an example of how to use the TF Hub module, or run an example in the browser on Colab.

  • We uploaded a new multilingual model which does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization), and additionally inclues Thai and Mongolian.

  • It is recommended to use this version for developing multilingual models, especially on languages with non-Latin alphabets.
  • This does not require any code changes, and can be downloaded here:

  • We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is currently 1st place on the leaderboard by 3%. See the 【SQuAD 2.0】 section.

【2018-11-05】Third-party PyTorch and Chainer versions of BERT

3、参考链接