- 1、Pre-trained models
- 2、BERT 历史更新记录(含相应模型下载链接)
- 3、参考链接
1、Pre-trained models
- We are releasing the
BERT-BaseandBERT-Largemodels from the paper.- BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
- BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
Uncasedmeans that the text has been lowercased before WordPiece tokenization, e.g.,John Smithbecomesjohn smith. TheUncasedmodel also strips out any accent markers.- Typically, the
Uncasedmodel is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).
- Typically, the
Casedmeans that the true case and accent markers are preserved.- When using a cased model, make sure to pass
**--do_lower=False**to the training scripts. (Or pass**do_lower_case=False**directly to**FullTokenizer**if you’re using your own script.)
- When using a cased model, make sure to pass
- Each
.zipfile contains three items:- A TensorFlow checkpoint (
bert_model.ckpt) containing the pre-trained weights (which is actually 3 files). - A vocab file (
vocab.txt) to map WordPiece to word id. - A config file (
bert_config.json) which specifies the hyperparameters of the model.
- A TensorFlow checkpoint (
- These models are all released under the same license as the source code (Apache 2.0).
For information about the Multilingual and Chinese model, see:
- https://github.com/google-research/bert/blob/master/multilingual.md
2、BERT 历史更新记录(含相应模型下载链接)
【2020-03-11】Smaller BERT Models
- https://github.com/google-research/bert/blob/master/multilingual.md
参考论文:
The standard BERT recipe([ˈresəpi] 配方; 食谱; 秘诀; 方法; 烹饪法; 诀窍;)(including model architecture and training objective) is effective on a wide range of model sizes, beyond
BERT-BaseandBERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.- Goal: to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity.
- This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking), You can download all 24 from:
Or download individually from the table below: |
| H=128 | H=256 | H=512 | H=768 | | —- | —- | —- | —- | —- | | L=2 | 2/128 (BERT-Tiny) | 2/256 | 2/512 | 2/768 | | L=4 | 4/128 | 4/256 (BERT-Mini) | 4/512 (BERT-Small) | 4/768 | | L=6 | 6/128 | 6/256 | 6/512 | 6/768 | | L=8 | 8/128 | 8/256 | 8/512 (BERT-Medium) | 8/768 | | L=10 | 10/128 | 10/256 | 10/512 | 10/768 | | L=12 | 12/128 | 12/256 | 12/512 | 12/768 (BERT-Base) |Here are the corresponding
GLUEscores on the test set: | Model | Score | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI-m | MNLI-mm | QNLI(v2) | RTE | WNLI | AX | | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | | BERT-Tiny | 64.2 | 0.0 | 83.2 | 81.1/71.1 | 74.3/73.6 | 62.2/83.4 | 70.2 | 70.3 | 81.5 | 57.2 | 62.3 | 21.0 | | BERT-Mini | 65.8 | 0.0 | 85.9 | 81.1/71.8 | 75.4/73.3 | 66.4/86.2 | 74.8 | 74.3 | 84.1 | 57.9 | 62.3 | 26.1 | | BERT-Small | 71.2 | 27.8 | 89.7 | 83.4/76.2 | 78.8/77.0 | 68.1/87.0 | 77.6 | 77.0 | 86.4 | 61.8 | 62.3 | 28.6 | | BERT-Medium | 73.5 | 38.0 | 89.6 | 86.6/81.6 | 80.4/78.4 | 69.6/87.9 | 80.0 | 79.1 | 87.7 | 62.2 | 62.3 | 30.5 |For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:
This is a release of several new models which were the result of an improvement the pre-processing code.
- In the original pre-processing code, we randomly select WordPiece tokens to mask. The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same. For example: ``` Input Text: the man jumped up , put his basket on phil ##am ##mon ‘ s head
Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ‘ s head
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ‘ s head ```
- The training is identical:we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too ‘easy’ for words that had been split into multiple WordPieces.
- This can be enabled during data generation by passing the flag
--do_whole_word_mask=Truetocreate_pretraining_data.py. - Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models.
- We only include
BERT-Largemodels. When using these models, please make it clear in the paper that you are using the Whole Word Masking variant of BERT-Large.- BERT-Large, Uncased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Large, Cased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
- https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip | Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy | | —- | —- | —- | | BERT-Large, Uncased (Original) | 91.0/84.3 | 86.05 | | BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7 | 87.07 | | BERT-Large, Cased (Original) | 91.5/84.8 | 86.09 | | BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46 |
- We only include
【2019-02-07】TfHub Module
BERT has been uploaded to TensorFlow Hub. See
run_classifier_with_tfhub.py(参考官网 github) for an example of how to use the TF Hub module, or run an example in the browser on Colab.- TensorFlow Hub:https://tfhub.dev/
- Colab:https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb
【2018-11-23】Un-normalized multilingual model + Thai + Mongolian
We uploaded a new multilingual model which does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization), and additionally inclues Thai and Mongolian.
- It is recommended to use this version for developing multilingual models, especially on languages with non-Latin alphabets.
This does not require any code changes, and can be downloaded here:
- BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
【2018-11-15】SOTA SQuAD 2.0 System
- https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
- BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is currently 1st place on the leaderboard by 3%. See the 【SQuAD 2.0】 section.
【2018-11-05】Third-party PyTorch and Chainer versions of BERT
NLP researchers from HuggingFace made a PyTorch version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. Sosuke Kobayashi also made a Chainer version of BERT available (Thanks!) We were not involved in the creation or maintenance of the PyTorch implementation so please direct any questions towards the authors of that repository.
- PyTorch version:https://github.com/huggingface/pytorch-pretrained-BERT
- Chainer version:https://github.com/soskek/bert-chainer
【2018-11-03】Multilingual and Chinese models
We have made two new BERT models available:
- BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
- We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box(开箱即用; 即开即用; 开包即用; 开箱即用的; 即装即用)without any code changes. We did update the implementation of
BasicTokenizerintokenization.pyto support Chinese character tokenization, so please update if you forked it. However, we did not change the tokenization API. - 更多关于 Multilingual 的说明请参考:
3、参考链接
- 官网 github
