1、BERT 介绍

  • BERTBidirectional Encoder Representations from Transformers)is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
  • BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
    • Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.
  • Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional.
  • BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). For example, in the sentence I made a bank deposit the unidirectional representation of bank is only based on I made a but not deposit. Some previous work does combine the representations from separate left-context and right-context models, but only in a “shallow” manner. BERT represents “bank” using both its left and right context — I made a … deposit — starting from the very bottom of a deep neural network, so it is deeply bidirectional.
    1. - Semi-supervised Sequence Learning
    2. - 《【2015-11-04Semi-supervised Sequence Learning
    3. - Generative Pre-Training
    4. - [https://openai.com/blog/language-unsupervised/](https://openai.com/blog/language-unsupervised/)
    5. - ELMo
    6. - [https://allenai.org/allennlp/software/elmo](https://allenai.org/allennlp/software/elmo)
    7. - ULMFit
    8. - not found
  • BERT uses a simple approach for this: We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words. For example:

    • Transformer
      • 【2017.12.06】Attention is all you need
        1. Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
        2. Labels: [MASK1] = store; [MASK2] = gallon
  • Paper(describes BERT in detail and provides full results on a number of tasks)

    • 【2019-05-24】BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
  • In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus? ``` Sentence A: the man went to the store . Sentence B: he bought a gallon of milk . Label: IsNextSentence

Sentence A: the man went to the store . Sentence B: penguins are flightless . Label: NotNextSentence ```

  • We then train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1M update steps), and that’s BERT.
  • Using BERT has two stages: Pre-training and fine-tuning.
    • Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language. Most NLP researchers will never need to pre-train their own model from scratch.
    • Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.
      • SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
  • The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on different tasks with almost no task-specific modifications:

    • sentence-level (e.g., SST-2)
    • sentence-pair-level (e.g., MultiNLI)
    • word-level (e.g., NER)
    • span-level (e.g., SQuAD)

      2、BERT 在各 task 的测试结果

  • These results were all obtained with almost no task-specific neural network architecture design.

    (a)SQuAD(Stanford Question Answering Dataset)question answering task

  • SQuAD v1.1https://rajpurkar.github.io/SQuAD-explorer/ | SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 | | —- | —- | —- | | 1st Place Ensemble - BERT | 87.4 | 93.2 | | 2nd Place Ensemble - nlnet | 86.0 | 91.7 | | 1st Place Single Model - BERT | 85.1 | 91.8 | | 2nd Place Single Model - nlnet | 83.5 | 90.1 |

(b)其它 NLI(natural language inference)任务

System MultiNLI Question NLI SWAG
BERT 86.7 91.1 86.3
OpenAI GPT (Prev. SOTA) 82.2 88.1 75.0

3、参考链接