【02】BERT - （04）Using BERT to extract fixed feature vectors (like ELMo) - 《【03】机器学习、深度学习》

Sentence A and Sentence B are separated by the ||| delimiter for sentence
pair tasks like question answering and entailment.
For single sentence inputs, put one sentence per line and DON’T use the
delimiter.

In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model. This should also mitigate most of the out-of-memory issues.
As an example, we include the script extract_features.py which can be used like this: ```shell
Sentence A and Sentence B are separated by the ||| delimiter for sentence
pair tasks like question answering and entailment.
For single sentence inputs, put one sentence per line and DON’T use the
delimiter.
echo ‘Who was Jim Henson ? ||| Jim Henson was a puppeteer’ > /tmp/input.txt

python extract_features.py \ —input_file=/tmp/input.txt \ —output_file=/tmp/output.jsonl \ —vocab_file=$BERT_BASE_DIR/vocab.txt \ —bert_config_file=$BERT_BASE_DIR/bert_config.json \ —init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ —layers=-1,-2,-3,-4 \ —max_seq_length=128 \ —batch_size=8 ```

This will create a JSON file (one line per line of input) containing the BERT activations from each Transformer layer specified by layers (-1 is the final hidden layer of the Transformer, etc.)
Note that this script will produce very large output files (by default, around 15kb for every input token).
If you need to maintain alignment between the original and tokenized words (for projecting training labels), see the Tokenization section.
Note: You may see a message like Could not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict. This message is expected, it just means that we are using the init_from_checkpoint() API rather than the saved model API. If you don’t specify a checkpoint or specify an invalid checkpoint, this script will complain.

（04）Using BERT to extract fixed feature vectors (like ELMo)

Sentence A and Sentence B are separated by the ||| delimiter for sentence

pair tasks like question answering and entailment.

For single sentence inputs, put one sentence per line and DON’T use the

delimiter.