1、介绍
- (以下介绍的是当前【2022-06-15】版本的信息,相关信息后续可能会有所更新)
- LASER:Language-Agnostic SEntence Representations
- 参考论文:【2019-09-25】Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
- Provide an encoder which was trained on 93 languages, written in 23 different alphabets. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, as well as various minority languages and dialects.
- 93 languages:Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.
- All these languages are encoded by the same BiLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.
- We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.:
- Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.
- We have also some evidence that the encoder can generalizes to other languages which have not been seen during training, but which are in a language family which is covered by other languages.
We provide a test set for more than 100 languages(在
/LASER/data/tatoeba/v1
路径下)based on the Tatoeba corpus.- test set for more than 100 languages:https://github.com/facebookresearch/LASER/tree/main/data/tatoeba/v1
- Tatoeba corpus:https://tatoeba.org/zh-cn
2、安装 & 使用
1)安装
Docker 形式安装与使用:
- 依赖包安装(
pip install xxx
)- numpy
- fastBPE
- fairseq(安装该包会自动安装 torch)
- Python 3.6
- PyTorch 1.0
- NumPy, tested with 1.15.4
- Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
- Faiss, for fast similarity search and bitext mining
- transliterate 1.10.2, only used for Greek (pip install transliterate)
- jieba 0.39, Chinese segmenter (pip install jieba)
- mecab 0.996, Japanese segmenter
- tokenization from the Moses encoder (installed automatically)
- FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
- Fairseq, sequence modeling toolkit (
pip install fairseq==0.10.2
) - Sentencepiece, subword tokenization (installed automatically) ```shell conda create —force -y —name jy_laser python=3.6 conda activate jy_laser
git clone https://github.com/facebookresearch/LASER.git
jy: /path/to/LASER
export LASER=”/home/huangjiayue/06_LASER/LASER” echo $LASER
cd LASER
jy: download encoders from Amazon s3; 会将相应模型下载到 ./models 文件夹下;
bash ./install_models.sh :<<! Downloading networks
- creating directory /home/huangjiayue/06_LASER/LASER/models
- bilstm.eparl21.2018-11-19.pt
- eparl21.fcodes
- eparl21.fvocab
- bilstm.93langs.2018-12-26.pt
- 93langs.fcodes
- 93langs.fvocab !
jy: 下载 wmt22 模型, 进行 sentence embedding 时可能需要用到;
bash ./tasks/wmt22/download_models.sh
jy: download third party software; 会将相关工具下载到 ./tools-external 文件夹下;
bash ./install_external_tools.sh
jy:download the data used in the example tasks (see description for each task)
<a name="VXXuu"></a>
## 2)使用(Task)
- (即 multilingual sentence embeddings 的应用示例)
- Cross-lingual document classification(using the MLDoc corpus)
- [https://github.com/facebookresearch/LASER/tree/main/tasks/mldoc](https://github.com/facebookresearch/LASER/tree/main/tasks/mldoc)
- MLDoc corpus:[https://github.com/facebookresearch/MLDoc](https://github.com/facebookresearch/MLDoc)
- 【2018-00-00】A Corpus for Multilingual Document Classification in Eight Languages
- WikiMatrix
- [https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix)
- 【2019-07-16】WikiMatrix:Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
- Bitext mining(using the BUCC corpus)
- [https://github.com/facebookresearch/LASER/tree/main/tasks/bucc](https://github.com/facebookresearch/LASER/tree/main/tasks/bucc)
- BUCC corpus:[https://comparable.limsi.fr/bucc2018/bucc2018-task.html](https://comparable.limsi.fr/bucc2018/bucc2018-task.html)
- 【2018-07-15】Filtering and Mining Parallel Data in a Joint Multilingual Space
- 【2019-08-07】Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
- Cross-lingual NLI(using the XNLI corpus)
- [https://github.com/facebookresearch/LASER/tree/main/tasks/xnli](https://github.com/facebookresearch/LASER/tree/main/tasks/xnli)
- 【2018-11-04】XNLI:Evaluating Cross-lingual Sentence Representations
- Multilingual similarity search
- [https://github.com/facebookresearch/LASER/tree/main/tasks/similarity](https://github.com/facebookresearch/LASER/tree/main/tasks/similarity)
- 【2017-08-08】Learning Joint Multilingual Sentence Representations with Neural Machine Translation
- calculate sentence embeddings for arbitrary text files in any of the supported language
- [https://github.com/facebookresearch/LASER/tree/main/tasks/embed](https://github.com/facebookresearch/LASER/tree/main/tasks/embed)
- Librivox S2S:Speech-to-Speech translations automatically mined in Librivox
- [https://github.com/facebookresearch/LASER/tree/main/tasks/librivox-s2s](https://github.com/facebookresearch/LASER/tree/main/tasks/librivox-s2s)
- 【2021-10-27】Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
- CCMatrix
- [https://github.com/facebookresearch/LASER/blob/main/tasks/CCMatrix](https://github.com/facebookresearch/LASER/blob/main/tasks/CCMatrix)
- 【2020-05-01】CCMatrix:Mining Billions of High-Quality Parallel Sentences on the WEB
- **For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.**
<a name="fCdzp"></a>
# 3、task
<a name="Q1rpH"></a>
## (1)sentence encoders for WMT '22 shared task - data track
- 参考:[https://statmt.org/wmt22/large-scale-multilingual-translation-task.html](https://statmt.org/wmt22/large-scale-multilingual-translation-task.html)
- To download encoders for all 24 supported languages, please run the `download_models.sh` script within this directory:
- `bash ./download_models.sh`
- This will place all supported models within the directory: `$LASER/models/wmt22`
- **Note**: encoders for each focus language are in the format: `laser3-xxx`, except for Afrikaans (afr), English (eng), and French (fra) which are all supported by the laser2 model.
- Available languages are: amh, fuv, hau, ibo, kam, kin, lin, lug, luo, nso, nya, orm, sna, som, ssw, swh, tsn, tso, umb, wol, xho, yor and zul
- Once all encoders are downloaded, you can then begin embedding texts(参考以下(2)小节)
<a name="l7VsX"></a>
## (2)句向量表征(calculation of sentence embeddings)
- 注意:当前更新的最新代码相比之前略有变化,暂未整理完整逻辑【undo】,可以基于“(3)multilingual similarity search”中的方式使用老模型间接获取 sentence embeddings。
- Task 代码目录:
- `cd /path/to/LASER/tasks/embed/`
- Tool to calculate sentence embeddings for an arbitrary text file:
- `bash ./embed.sh <inputFile> <outputEmbeddingFile> [LANGUAGE (ISO3)]`
- 中文对应的 ISO3 Language Code 为:`zho`?
- [https://en.wikipedia.org/wiki/ISO_639-3](https://en.wikipedia.org/wiki/ISO_639-3)
- The input will first be tokenized, and then sentence embeddings will be generated. If a language is specified, then `embed.sh` will look for a language-specific encoder (specified by a three-letter langauge code). Otherwise it will default to `LASER2`, which covers 93 languages.
- 参考:【2019-09-25】Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
- **Examples**
- In order to encode an input text in any of the 93 languages supported by `LASER2` (e.g. `afr`, `eng`, `fra`):
- `./embed.sh input_file output_file`
- To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Oromo:
- `./embed.sh input_file output_file wol`
- `./embed.sh input_file output_file hau`
- `./embed.sh input_file output_file orm`
- The output embeddings are stored in float32 matrices in raw binary format. They can be read in Python by:
```python
import numpy as np
import torch
def get_embeddings(f_name):
dim = 1024
# X is a N x 1024 matrix where N is the number of lines in the text file.
X = np.fromfile(f_name, dtype=np.float32, count=-1)
X.resize(X.shape[0] // dim, dim)
#print(X.shape)
tensor_x = torch.from_numpy(X)
# jy: 输出维度为: (文件句子行数, 1024)
print(tensor_x.shape)
return tensor_x
"""
f_name = "abst_en.emb"
get_embeddings(f_name)
"""
(3)multilingual similarity search
- This codes shows how to embed an N-way parallel corpus(we use the publicly available newstest2012 from WMT 2012), and how to calculate the similarity search error rate for each language pair.
- For each sentence in the source language, we calculate the closest sentence in the joint embedding space in the target language. If this sentence has the same index in the file, it is considered as correct, and as an error else wise. Therefore, the N-way parallel corpus should not contain duplicates.
- simply run the script bash
./wmt.sh
to downloads the data, calculate the sentence embeddings and the similarity search error rate for each language pair. - You should get the following similarity search errors: | | cs | de | en | es | fr | avg | | —- | —- | —- | —- | —- | —- | —- | | cs | 0.00% | 0.70% | 0.90% | 0.67% | 0.77% | 0.76% | | de | 0.83% | 0.00% | 1.17% | 0.90% | 1.03% | 0.98% | | en | 0.93% | 1.27% | 0.00% | 0.83% | 1.07% | 1.02% | | es | 0.53% | 0.77% | 0.97% | 0.00% | 0.57% | 0.71% | | fr | 0.50% | 0.90% | 1.13% | 0.60% | 0.00% | 0.78% | | avg | 0.70% | 0.91% | 1.04% | 0.75% | 0.86% | 1.06% |