BERT Embedding

BERTEmbedding is based on keras-bert. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

BERTEmbedding support BERT variants like ERNIE, but need to load the tensorflow checkpoint. If you intrested to use ERNIE, just download tensorflow_ernie and load like BERT Embedding.

!!! tip When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

  1. kashgari.embeddings.BERTEmbedding(model_folder: str,
  2. layer_nums: int = 4,
  3. trainable: bool = False,
  4. task: str = None,
  5. sequence_length: Union[str, int] = 'auto',
  6. processor: Optional[BaseProcessor] = None)

Arguments

  • model_folder: path of checkpoint folder.
  • layer_nums: number of layers whose outputs will be concatenated into a single tensor, default 4, output the last 4 hidden layers as the thesis suggested.
  • trainable: whether if the model is trainable, default False and set it to True for fine-tune this embedding layer during your training.
  • task: kashgari.CLASSIFICATION kashgari.LABELING. Downstream task type, If you only need to feature extraction, just set it as kashgari.CLASSIFICATION.
  • sequence_length: 'auto', 'variable' or integer. When using 'auto', use the 95% of corpus length as sequence length. When using 'variable', model input shape will set to None, which can handle various length of input, it will use the length of max sequence in every batch for sequence length. If using an integer, let’s say 50, the input output sequence length will set to 50.

Example Usage - Text Classification

Let’s run a text classification model with BERT.

  1. sentences = [
  2. "Jim Henson was a puppeteer.",
  3. "This here's an example of using the BERT tokenizer.",
  4. "Why did the chicken cross the road?"
  5. ]
  6. labels = [
  7. "class1",
  8. "class2",
  9. "class1"
  10. ]
  11. ########## Load Bert Embedding ##########
  12. import kashgari
  13. from kashgari.embeddings import BERTEmbedding
  14. bert_embedding = BERTEmbedding(bert_model_path,
  15. task=kashgari.CLASSIFICATION,
  16. sequence_length=128)
  17. tokenizer = bert_embedding.tokenizer
  18. sentences_tokenized = []
  19. for sentence in sentences:
  20. sentence_tokenized = tokenizer.tokenize(sentence)
  21. sentences_tokenized.append(sentence_tokenized)
  22. """
  23. The sentences will become tokenized into:
  24. [
  25. ['[CLS]', 'jim', 'henson', 'was', 'a', 'puppet', '##eer', '.', '[SEP]'],
  26. ['[CLS]', 'this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.', '[SEP]'],
  27. ['[CLS]', 'why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?', '[SEP]']
  28. ]
  29. """
  30. # Our tokenizer already added the BOS([CLS]) and EOS([SEP]) token
  31. # so we need to disable the default add_bos_eos setting.
  32. bert_embedding.processor.add_bos_eos = False
  33. train_x, train_y = sentences_tokenized[:2], labels[:2]
  34. validate_x, validate_y = sentences_tokenized[2:], labels[2:]
  35. ########## build model ##########
  36. from kashgari.tasks.classification import CNNLSTMModel
  37. model = CNNLSTMModel(bert_embedding)
  38. ########## /build model ##########
  39. model.fit(
  40. train_x, train_y,
  41. validate_x, validate_y,
  42. epochs=3,
  43. batch_size=32
  44. )
  45. # save model
  46. model.save('path/to/save/model/to')

Use sentence pairs for input

let’s assume input pair sample is "First do it" "then do it right", Then first tokenize the sentences using bert tokenizer. Then

  1. sentence1 = ['First', 'do', 'it']
  2. sentence2 = ['then', 'do', 'it', 'right']
  3. sample = sentence1 + ["[SEP]"] + sentence2
  4. # Add a special separation token `[SEP]` between two sentences tokens
  5. # Generate a new token list
  6. # ['First', 'do', 'it', '[SEP]', 'then', 'do', 'it', 'right']
  7. train_x = [sample]

Pre-trained models

model provider Language Link info
BERT official Google Multi Language link
ERNIE Baidu Chinese link Unofficial Tensorflow Version
Chinese BERT WWM 哈工大讯飞联合实验室 Chinese link Use Tensorflow Version