- 1、Pre-trained models
- 2、BERT 历史更新记录(含相应模型下载链接)
- 3、参考链接
1、Pre-trained models
- We are releasing the
BERT-Base
andBERT-Large
models from the paper.- BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
- BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
Uncased
means that the text has been lowercased before WordPiece tokenization, e.g.,John Smith
becomesjohn smith
. TheUncased
model also strips out any accent markers.- Typically, the
Uncased
model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).
- Typically, the
Cased
means that the true case and accent markers are preserved.- When using a cased model, make sure to pass
**--do_lower=False**
to the training scripts. (Or pass**do_lower_case=False**
directly to**FullTokenizer**
if you’re using your own script.)
- When using a cased model, make sure to pass
- Each
.zip
file contains three items:- A TensorFlow checkpoint (
bert_model.ckpt
) containing the pre-trained weights (which is actually 3 files). - A vocab file (
vocab.txt
) to map WordPiece to word id. - A config file (
bert_config.json
) which specifies the hyperparameters of the model.
- A TensorFlow checkpoint (
- These models are all released under the same license as the source code (Apache 2.0).
For information about the Multilingual and Chinese model, see:
- https://github.com/google-research/bert/blob/master/multilingual.md
2、BERT 历史更新记录(含相应模型下载链接)
【2020-03-11】Smaller BERT Models
- https://github.com/google-research/bert/blob/master/multilingual.md
参考论文:
The standard BERT recipe([ˈresəpi] 配方; 食谱; 秘诀; 方法; 烹饪法; 诀窍;)(including model architecture and training objective) is effective on a wide range of model sizes, beyond
BERT-Base
andBERT-Large
. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.- Goal: to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity.
- This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking), You can download all 24 from:
Or download individually from the table below: |
| H=128 | H=256 | H=512 | H=768 | | —- | —- | —- | —- | —- | | L=2 | 2/128 (BERT-Tiny) | 2/256 | 2/512 | 2/768 | | L=4 | 4/128 | 4/256 (BERT-Mini) | 4/512 (BERT-Small) | 4/768 | | L=6 | 6/128 | 6/256 | 6/512 | 6/768 | | L=8 | 8/128 | 8/256 | 8/512 (BERT-Medium) | 8/768 | | L=10 | 10/128 | 10/256 | 10/512 | 10/768 | | L=12 | 12/128 | 12/256 | 12/512 | 12/768 (BERT-Base) |Here are the corresponding
GLUE
scores on the test set: | Model | Score | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI-m | MNLI-mm | QNLI(v2) | RTE | WNLI | AX | | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | | BERT-Tiny | 64.2 | 0.0 | 83.2 | 81.1/71.1 | 74.3/73.6 | 62.2/83.4 | 70.2 | 70.3 | 81.5 | 57.2 | 62.3 | 21.0 | | BERT-Mini | 65.8 | 0.0 | 85.9 | 81.1/71.8 | 75.4/73.3 | 66.4/86.2 | 74.8 | 74.3 | 84.1 | 57.9 | 62.3 | 26.1 | | BERT-Small | 71.2 | 27.8 | 89.7 | 83.4/76.2 | 78.8/77.0 | 68.1/87.0 | 77.6 | 77.0 | 86.4 | 61.8 | 62.3 | 28.6 | | BERT-Medium | 73.5 | 38.0 | 89.6 | 86.6/81.6 | 80.4/78.4 | 69.6/87.9 | 80.0 | 79.1 | 87.7 | 62.2 | 62.3 | 30.5 |For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:
This is a release of several new models which were the result of an improvement the pre-processing code.
- In the original pre-processing code, we randomly select WordPiece tokens to mask. The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same. For example: ``` Input Text: the man jumped up , put his basket on phil ##am ##mon ‘ s head
Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ‘ s head
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ‘ s head ```
- The training is identical:we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too ‘easy’ for words that had been split into multiple WordPieces.
- This can be enabled during data generation by passing the flag
--do_whole_word_mask=True
tocreate_pretraining_data.py
. - Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models.
- We only include
BERT-Large
models. When using these models, please make it clear in the paper that you are using the Whole Word Masking variant of BERT-Large.- BERT-Large, Uncased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
- BERT-Large, Cased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
- https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip | Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy | | —- | —- | —- | | BERT-Large, Uncased (Original) | 91.0/84.3 | 86.05 | | BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7 | 87.07 | | BERT-Large, Cased (Original) | 91.5/84.8 | 86.09 | | BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46 |
- We only include
【2019-02-07】TfHub Module
BERT has been uploaded to TensorFlow Hub. See
run_classifier_with_tfhub.py
(参考官网 github) for an example of how to use the TF Hub module, or run an example in the browser on Colab.- TensorFlow Hub:https://tfhub.dev/
- Colab:https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb
【2018-11-23】Un-normalized multilingual model + Thai + Mongolian
We uploaded a new multilingual model which does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization), and additionally inclues Thai and Mongolian.
- It is recommended to use this version for developing multilingual models, especially on languages with non-Latin alphabets.
This does not require any code changes, and can be downloaded here:
- BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
【2018-11-15】SOTA SQuAD 2.0 System
- https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
- BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is currently 1st place on the leaderboard by 3%. See the 【SQuAD 2.0】 section.
【2018-11-05】Third-party PyTorch and Chainer versions of BERT
NLP researchers from HuggingFace made a PyTorch version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. Sosuke Kobayashi also made a Chainer version of BERT available (Thanks!) We were not involved in the creation or maintenance of the PyTorch implementation so please direct any questions towards the authors of that repository.
- PyTorch version:https://github.com/huggingface/pytorch-pretrained-BERT
- Chainer version:https://github.com/soskek/bert-chainer
【2018-11-03】Multilingual and Chinese models
We have made two new BERT models available:
- BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
- We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box(开箱即用; 即开即用; 开包即用; 开箱即用的; 即装即用)without any code changes. We did update the implementation of
BasicTokenizer
intokenization.py
to support Chinese character tokenization, so please update if you forked it. However, we did not change the tokenization API. - 更多关于 Multilingual 的说明请参考:
3、参考链接
- 官网 github