1、介绍

  • (以下介绍的是当前【2022-06-15】版本的信息,相关信息后续可能会有所更新)
  • LASER:Language-Agnostic SEntence Representations
  • 参考论文:【2019-09-25】Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
  • Provide an encoder which was trained on 93 languages, written in 23 different alphabets. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, as well as various minority languages and dialects.
    • 93 languages:Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.
  • All these languages are encoded by the same BiLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.
  • We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.:
    • Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.
  • We have also some evidence that the encoder can generalizes to other languages which have not been seen during training, but which are in a language family which is covered by other languages.
  • We provide a test set for more than 100 languages(在/LASER/data/tatoeba/v1路径下)based on the Tatoeba corpus.

  • Docker 形式安装与使用:

  • 依赖包安装(pip install xxx
    • numpy
    • fastBPE
    • fairseq(安装该包会自动安装 torch)

    • Python 3.6
    • PyTorch 1.0
    • NumPy, tested with 1.15.4
    • Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
    • Faiss, for fast similarity search and bitext mining
    • transliterate 1.10.2, only used for Greek (pip install transliterate)
    • jieba 0.39, Chinese segmenter (pip install jieba)
    • mecab 0.996, Japanese segmenter
    • tokenization from the Moses encoder (installed automatically)
    • FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
    • Fairseq, sequence modeling toolkit (pip install fairseq==0.10.2)
    • Sentencepiece, subword tokenization (installed automatically) ```shell conda create —force -y —name jy_laser python=3.6 conda activate jy_laser

git clone https://github.com/facebookresearch/LASER.git

jy: /path/to/LASER

export LASER=”/home/huangjiayue/06_LASER/LASER” echo $LASER

cd LASER

jy: download encoders from Amazon s3; 会将相应模型下载到 ./models 文件夹下;

bash ./install_models.sh :<<! Downloading networks

  • creating directory /home/huangjiayue/06_LASER/LASER/models
  • bilstm.eparl21.2018-11-19.pt
  • eparl21.fcodes
  • eparl21.fvocab
  • bilstm.93langs.2018-12-26.pt
  • 93langs.fcodes
  • 93langs.fvocab !

jy: 下载 wmt22 模型, 进行 sentence embedding 时可能需要用到;

bash ./tasks/wmt22/download_models.sh

jy: download third party software; 会将相关工具下载到 ./tools-external 文件夹下;

bash ./install_external_tools.sh

jy:download the data used in the example tasks (see description for each task)

  1. <a name="VXXuu"></a>
  2. ## 2)使用(Task)
  3. - (即 multilingual sentence embeddings 的应用示例)
  4. - Cross-lingual document classification(using the MLDoc corpus)
  5. - [https://github.com/facebookresearch/LASER/tree/main/tasks/mldoc](https://github.com/facebookresearch/LASER/tree/main/tasks/mldoc)
  6. - MLDoc corpus:[https://github.com/facebookresearch/MLDoc](https://github.com/facebookresearch/MLDoc)
  7. - 【2018-00-00】A Corpus for Multilingual Document Classification in Eight Languages
  8. - WikiMatrix
  9. - [https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix)
  10. - 【2019-07-16】WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
  11. - Bitext mining(using the BUCC corpus)
  12. - [https://github.com/facebookresearch/LASER/tree/main/tasks/bucc](https://github.com/facebookresearch/LASER/tree/main/tasks/bucc)
  13. - BUCC corpus:[https://comparable.limsi.fr/bucc2018/bucc2018-task.html](https://comparable.limsi.fr/bucc2018/bucc2018-task.html)
  14. - 【2018-07-15】Filtering and Mining Parallel Data in a Joint Multilingual Space
  15. - 【2019-08-07】Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
  16. - Cross-lingual NLI(using the XNLI corpus)
  17. - [https://github.com/facebookresearch/LASER/tree/main/tasks/xnli](https://github.com/facebookresearch/LASER/tree/main/tasks/xnli)
  18. - 【2018-11-04】XNLI:Evaluating Cross-lingual Sentence Representations
  19. - Multilingual similarity search
  20. - [https://github.com/facebookresearch/LASER/tree/main/tasks/similarity](https://github.com/facebookresearch/LASER/tree/main/tasks/similarity)
  21. - 【2017-08-08】Learning Joint Multilingual Sentence Representations with Neural Machine Translation
  22. - calculate sentence embeddings for arbitrary text files in any of the supported language
  23. - [https://github.com/facebookresearch/LASER/tree/main/tasks/embed](https://github.com/facebookresearch/LASER/tree/main/tasks/embed)
  24. - Librivox S2S:Speech-to-Speech translations automatically mined in Librivox
  25. - [https://github.com/facebookresearch/LASER/tree/main/tasks/librivox-s2s](https://github.com/facebookresearch/LASER/tree/main/tasks/librivox-s2s)
  26. - 【2021-10-27】Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
  27. - CCMatrix
  28. - [https://github.com/facebookresearch/LASER/blob/main/tasks/CCMatrix](https://github.com/facebookresearch/LASER/blob/main/tasks/CCMatrix)
  29. - 【2020-05-01】CCMatrix:Mining Billions of High-Quality Parallel Sentences on the WEB
  30. - **For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.**
  31. <a name="fCdzp"></a>
  32. # 3、task
  33. <a name="Q1rpH"></a>
  34. ## (1)sentence encoders for WMT '22 shared task - data track
  35. - 参考:[https://statmt.org/wmt22/large-scale-multilingual-translation-task.html](https://statmt.org/wmt22/large-scale-multilingual-translation-task.html)
  36. - To download encoders for all 24 supported languages, please run the `download_models.sh` script within this directory:
  37. - `bash ./download_models.sh`
  38. - This will place all supported models within the directory: `$LASER/models/wmt22`
  39. - **Note**: encoders for each focus language are in the format: `laser3-xxx`, except for Afrikaans (afr), English (eng), and French (fra) which are all supported by the laser2 model.
  40. - Available languages are: amh, fuv, hau, ibo, kam, kin, lin, lug, luo, nso, nya, orm, sna, som, ssw, swh, tsn, tso, umb, wol, xho, yor and zul
  41. - Once all encoders are downloaded, you can then begin embedding texts(参考以下(2)小节)
  42. <a name="l7VsX"></a>
  43. ## (2)句向量表征(calculation of sentence embeddings)
  44. - 注意:当前更新的最新代码相比之前略有变化,暂未整理完整逻辑【undo】,可以基于“(3)multilingual similarity search”中的方式使用老模型间接获取 sentence embeddings。
  45. - Task 代码目录:
  46. - `cd /path/to/LASER/tasks/embed/`
  47. - Tool to calculate sentence embeddings for an arbitrary text file:
  48. - `bash ./embed.sh <inputFile> <outputEmbeddingFile> [LANGUAGE (ISO3)]`
  49. - 中文对应的 ISO3 Language Code 为:`zho`?
  50. - [https://en.wikipedia.org/wiki/ISO_639-3](https://en.wikipedia.org/wiki/ISO_639-3)
  51. - The input will first be tokenized, and then sentence embeddings will be generated. If a language is specified, then `embed.sh` will look for a language-specific encoder (specified by a three-letter langauge code). Otherwise it will default to `LASER2`, which covers 93 languages.
  52. - 参考:【2019-09-25】Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
  53. - **Examples**
  54. - In order to encode an input text in any of the 93 languages supported by `LASER2` (e.g. `afr`, `eng`, `fra`):
  55. - `./embed.sh input_file output_file`
  56. - To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Oromo:
  57. - `./embed.sh input_file output_file wol`
  58. - `./embed.sh input_file output_file hau`
  59. - `./embed.sh input_file output_file orm`
  60. - The output embeddings are stored in float32 matrices in raw binary format. They can be read in Python by:
  61. ```python
  62. import numpy as np
  63. import torch
  64. def get_embeddings(f_name):
  65. dim = 1024
  66. # X is a N x 1024 matrix where N is the number of lines in the text file.
  67. X = np.fromfile(f_name, dtype=np.float32, count=-1)
  68. X.resize(X.shape[0] // dim, dim)
  69. #print(X.shape)
  70. tensor_x = torch.from_numpy(X)
  71. # jy: 输出维度为: (文件句子行数, 1024)
  72. print(tensor_x.shape)
  73. return tensor_x
  74. """
  75. f_name = "abst_en.emb"
  76. get_embeddings(f_name)
  77. """

(3)multilingual similarity search

  • This codes shows how to embed an N-way parallel corpus(we use the publicly available newstest2012 from WMT 2012), and how to calculate the similarity search error rate for each language pair.
  • For each sentence in the source language, we calculate the closest sentence in the joint embedding space in the target language. If this sentence has the same index in the file, it is considered as correct, and as an error else wise. Therefore, the N-way parallel corpus should not contain duplicates.
  • simply run the script bash ./wmt.sh to downloads the data, calculate the sentence embeddings and the similarity search error rate for each language pair.
  • You should get the following similarity search errors: | | cs | de | en | es | fr | avg | | —- | —- | —- | —- | —- | —- | —- | | cs | 0.00% | 0.70% | 0.90% | 0.67% | 0.77% | 0.76% | | de | 0.83% | 0.00% | 1.17% | 0.90% | 1.03% | 0.98% | | en | 0.93% | 1.27% | 0.00% | 0.83% | 1.07% | 1.02% | | es | 0.53% | 0.77% | 0.97% | 0.00% | 0.57% | 0.71% | | fr | 0.50% | 0.90% | 1.13% | 0.60% | 0.00% | 0.78% | | avg | 0.70% | 0.91% | 1.04% | 0.75% | 0.86% | 1.06% |

4、参考