- In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model. This should also mitigate most of the out-of-memory issues.
- As an example, we include the script
extract_features.py
which can be used like this: ```shellSentence A and Sentence B are separated by the ||| delimiter for sentence
pair tasks like question answering and entailment.
For single sentence inputs, put one sentence per line and DON’T use the
delimiter.
echo ‘Who was Jim Henson ? ||| Jim Henson was a puppeteer’ > /tmp/input.txt
python extract_features.py \ —input_file=/tmp/input.txt \ —output_file=/tmp/output.jsonl \ —vocab_file=$BERT_BASE_DIR/vocab.txt \ —bert_config_file=$BERT_BASE_DIR/bert_config.json \ —init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ —layers=-1,-2,-3,-4 \ —max_seq_length=128 \ —batch_size=8 ```
- This will create a JSON file (one line per line of input) containing the BERT activations from each Transformer layer specified by
layers
(-1 is the final hidden layer of the Transformer, etc.) - Note that this script will produce very large output files (by default, around 15kb for every input token).
- If you need to maintain alignment between the original and tokenized words (for projecting training labels), see the Tokenization section.
- Note: You may see a message like
Could not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict.
This message is expected, it just means that we are using theinit_from_checkpoint()
API rather than the saved model API. If you don’t specify a checkpoint or specify an invalid checkpoint, this script will complain.