0、背景介绍

Sentence alignment is the task of taking parallel documents（which have been split into sentences）and finding a bipartite graph（二分图）which matches minimal groups of sentences that are translations of each other.
Sentence alignment takes sentencesandand locates minimal groups of sentences which are translations of each other, in this case：

Sentence-aligned bitext is used to train nearly all machine translation (MT) systems. Alignment errors have been noted to have a small effect on statistical MT performance（【2012-00-00】The impact of sentence alignment errors on phrase-based machine translation performance）, However, misaligned sentences have been shown to be much more detrimental to neural MT (NMT)（【2018-07-20】On the Impact of Various Types of Noise on Neural Machine Translation）
Automatic sentence alignment can be roughly decomposed into two parts:
- A score function which takes one or more adjacent source sentences and one or more adjacent target sentences and returns a score indicating the likelihood that they are translations of each other.
- An alignment algorithm which, using the score function above, takes in two documents and returns a hypothesis alignment.
We improve both parts, presenting：
- A novel scoring function based on normalized cosine distance between multilingual sentence embeddings.
- A novel application of a dynamic programming approximation which makes our algorithm linear in time and space complexity with respect to the number of sentences being aligned.
  - 参考：【2007-00-00】Toward accurate dynamic time warping in linear time and space
    句对齐发展历程
Early sentence aligners use scoring functions based only on the number of words or characters in each sentence and alignment algorithms based on dynamic programming（DP）.
- DP is time complexity, whereandare the number of sentences in the source and target documents.
Later work added lexical features and heuristics to speed up search, such as limiting the search space to be near the diagonal.
More recent work introduced scoring methods that use MT to get both documents into the same language (Bleualign) or use pruned phrase tables from a statistical MT system (Coverage-Based). Both methods “anchor” high-probability 1–1 alignments in the search space and then fill in and refine alignments. Locating anchors istime complexity.

1、方法介绍
Assume non-crossing alignments but allow local sentence reordering within an alignment.
Vecalign：a novel bilingual sentence alignment method which is linear in time and space with respect to the number of sentences being aligned and which requires only bilingual sentence embeddings.
- based on similarity of sentence embeddings and a DP approximation which is fast even for long documents.
- A distinct but non-obvious advantage of sentence embeddings is that blocks of sentences can be represented as the average of their sentence embeddings.
- The size of the resulting vector is not dependent on the number of sentence embeddings being averaged, thus the time/space cost of comparing the similarity of blocks of sentences does not depend on the number of sentences being compared.
- We show empirically that average embeddings for blocks of sentences are sufficient to produce approximate alignments, even in low-resource languages. This enables us to approximate DP inin time and space.
  （1）Bilingual Sentence Embeddings
Use the publicly available LASER multilingual sentence embedding method and model（which is pretrained on 93 languages）
- our method is not specifific to LASER
  （2）Scoring Function
Cosine similarity is an obvious choice for comparing embeddings but has been noted to be globally inconsistent due to “hubness”. 参考：
- 【2010-10-09】Hubs in Space：Popular Nearest Neighbors in High-Dimensional Data
- 【2015-00-00】Hubness and Pollution：Delving into Cross-Space Mapping for Zero-Shot Learning
解决 hubness 的方案：
- supervised training approach for calibration
  - 【2018-11-01】Effective Parallel Corpus Mining using Bilingual Sentence Embeddings
- normalization using nearest neighbors
  - 【2019-08-07】Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
- 本文提出：normalizing instead with randomly_ _selected embeddings as it has linear complexity.
Sentence alignment seeks minimal_ _parallel units, but we find that DP with cosine similarity favors many-to-many alignments (e.g. reporting a 3–3 alignment when it should report three 1–1 alignments).
- To remedy this issue, we scale the cost by the number of source and target sentences being considered in a given alignment.
Our resulting scoring cost function is:
- denote one or more sequential sentences from the source/target document.
- ：cosine similarity between embeddings of
- 和：the number of sentences in
- 和：sampled uniformly from the given document
Following standard practice, we model insertions and deletions in DP using a skip cost：
- The raw value ofis only meaningful when compared to other costs, thus we do not expect it to generalize across different languages, normalizations, or resolutions.
- We propose specifying instead a parameterwhich defines the skip cost in terms of the distribution of 1–1 alignment costs at alignment time：
  - - CDF：an estimate of the cumulative distribution function of 1–1 alignments obtained by computing costs of randomly selected source/target sentences pairs.
      （3）Recursive DP Approximation
Instead of searching all possible sentence alignments via DP, consider first averaging adjacent pairs of sentence embeddings in both the source and target documents, halving the number of embeddings for each document.
Aligning these vectors via DP (each of which are averages of 2 sentence embeddings) produces an _approximate _sentence alignment, at a cost of comparisons.
We can then refine this approximate alignment using the original sentence vectors, constraining ourselves to a small window around the approximate alignment.
At a minimum, we must search a window size（默认为 10）large enough to consider all paths covered by the lower-resolution alignment path, butcan also be increased to allow recovery from small errors in the approximate alignment.
- The length of the refinement path to search is at most(all deletions/insertions), so re-fining the path requires at mostcomparisons.
- Thus the fullcomparisons can be approximated by
- Applied recursively, we can approximate our quadraticcost with a sum of linear costs:

（09）论文解读 - 图34

  - In practice, we compute the full DP alignment once the down sampled sizes are below an acceptably small constant. We also find vectors for large blocks of sentences become correlated with each other, so we center them around **0**.

Illustration of this method：1–1 alignment costs (darker = lower) for the first 88 De lines (x-axis) and 128 Fr lines (y-axis) at 4 different resolutions（Window size is increased for visualization purposes）：
- The red highlight denotes alignment found by DP.
- The algorithm only searches near the path found at previous resolutions, light blue regions are excluded.
- The vertical part of the path in the top left of each plot is due to 36 extra lines being present in the Fr document.

We consider only insertions, deletions, and 1–1 alignments in all but the final search.
Recursive down sampling and refining of DP was proposed for dynamic time warping in （【2007-00-00】Toward accurate dynamic time warping in linear time and space）, but has not previously been applied to sentence alignment. We direct the reader to that work for a more formal analysis showing the time/space complexity is linear.

2、实验结果
On a standard German–French test set, Vecalign outperforms the previous state-of-the-art method（which has quadratic time complexity and requires a machine translation system）by 5 F1 points.
- Indicating that our proposed score function outperforms prior work and the approximations we make in alignment are sufficiently accurate.
It substantially outperforms the popular Hunalign toolkit at recovering Bible verse alignments in medium- to low-resource language pairs, and it improves downstream MT quality by 1.7 and 1.6 BLEU in Sinhala→_English and Nepali→_English, respectively, compared to the Hunalign-based Paracrawl pipeline.
Our method has state-of-the-art accuracy in high and low resource settings and improves downstream MT quality.
【2019-11-03】Vecalign：Improved Sentence Alignment in Linear Time and Space
【2020-11-16】Exploiting Sentence Order in Document Alignment

（09）论文解读

0、背景介绍

句对齐发展历程

1、方法介绍

（1）Bilingual Sentence Embeddings

（2）Scoring Function

（3）Recursive DP Approximation

2、实验结果