1、BERT 介绍
- BERT(Bidirectional Encoder Representations from Transformers)is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
- BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
- Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.
- Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional.
- Context-free models such as
word2vec
orGloVe
generate a single “word embedding” representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. - Contextual models instead generate a representation of each word that is based on the other words in the sentence.
- Context-free models such as
- BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). For example, in the sentence I made a bank deposit the unidirectional representation of bank is only based on I made a but not deposit. Some previous work does combine the representations from separate left-context and right-context models, but only in a “shallow” manner. BERT represents “bank” using both its left and right context — I made a … deposit — starting from the very bottom of a deep neural network, so it is deeply bidirectional.
- Semi-supervised Sequence Learning
- 《【2015-11-04】Semi-supervised Sequence Learning》
- Generative Pre-Training
- [https://openai.com/blog/language-unsupervised/](https://openai.com/blog/language-unsupervised/)
- ELMo
- [https://allenai.org/allennlp/software/elmo](https://allenai.org/allennlp/software/elmo)
- ULMFit
- not found
BERT uses a simple approach for this: We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:
- Transformer
- 【2017.12.06】Attention is all you need
Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
- 【2017.12.06】Attention is all you need
- Transformer
Paper(describes BERT in detail and provides full results on a number of tasks)
- 【2019-05-24】BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
- In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences
A
andB
, isB
the actual next sentence that comes afterA
, or just a random sentence from the corpus? ``` Sentence A: the man went to the store . Sentence B: he bought a gallon of milk . Label: IsNextSentence
Sentence A: the man went to the store . Sentence B: penguins are flightless . Label: NotNextSentence ```
- We then train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1M update steps), and that’s BERT.
- BookCorpus
- Using BERT has two stages: Pre-training and fine-tuning.
- Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language. Most NLP researchers will never need to pre-train their own model from scratch.
- Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.
- SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on different tasks with almost no task-specific modifications:
These results were all obtained with almost no task-specific neural network architecture design.
(a)SQuAD(Stanford Question Answering Dataset)question answering task
SQuAD v1.1
:https://rajpurkar.github.io/SQuAD-explorer/ | SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 | | —- | —- | —- | | 1st Place Ensemble - BERT | 87.4 | 93.2 | | 2nd Place Ensemble - nlnet | 86.0 | 91.7 | | 1st Place Single Model - BERT | 85.1 | 91.8 | | 2nd Place Single Model - nlnet | 83.5 | 90.1 |
(b)其它 NLI(natural language inference)任务
System | MultiNLI | Question NLI | SWAG |
---|---|---|---|
BERT | 86.7 | 91.1 | 86.3 |
OpenAI GPT (Prev. SOTA) | 82.2 | 88.1 | 75.0 |
3、参考链接
- 官网 github:https://github.com/google-research/bert
- 基于 tensorflow 实现,基本框架为 Transformer(参见论文:【2017.12.06】Attention is all you need)