ABSTRACT
    Pre-training of text and layout has proved effective in a variety of visually- rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents.
    In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged.
    Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks(不仅使用传统的遮蔽任务,而且还使用了文本-图像对齐任务和文本-图片匹配任务) in the pre-training stage, where cross-modality inter-action is better learned.
    Meanwhile, it also integrates a spatial-aware self-attention mechanism空间感知的自我注意力机制into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually- rich document understanding tasks, including
    FUNSD (0.7895 → 0.8420),
    CORD (0.9493 → 0.9601),
    SROIE (0.9524 → 0.9781),
    Kleister-NDA (0.834 → 0.852),
    RVL-CDIP (0.9443 → 0.9564),
    and DocVQA (0.7295 → 0.8672).
    The pre-trained LayoutLMv2 model is publicly available at https://aka.ms/ layoutlmv2.

    1 INTRODUCTION
    Visually-rich Document Understanding (VrDU) aims to analyze scanned/digital-born business documents (images, PDFs, etc.) where structured information can be automatically extracted and organized for many business applications. Distinct from conventional information extraction tasks(常规信息提取任务), the VrDU task not only relies on textual information, but also visual and layout information that is vital for visually-rich documents. For instance, the documents in Figure 1 include a variety of types such as digital forms, receipts, invoices and financial reports. Different types of documents indicate that the text fields of interest locate at different positions within the document, which is often determined by the style and format of each type as well as the document content. Therefore, to accurately recognize the text fields of interest, it is inevitable to take advantage of the cross-modality nature of visually-rich documents, where the textual, visual and layout information should be jointly modeled and learned end-to-end in a single framework.
    image.png

    The recent progress of VrDU lies primarily in two directions. The first direction is usually built on the shallow fusion between textual and visual/layout/style information (Yang et al., 2017a; Liu et al., 2019; Sarkhel & Nandi, 2019; Yu et al., 2020; Majumder et al., 2020; Wei et al., 2020; Zhang et al., 2020). These approaches leverage the pre-trained NLP and CV models individually and combine the information from multiple modalities for supervised learning. Although good performance has been achieved, these models often need to be re-trained from scratch once the document type is changed. In addition, the domain knowledge of one document type cannot be easily transferred into another document type, thereby the local invariance in general document layout (e.g. key-value pairs in a left-right layout, tables in a grid layout, etc.) cannot be fully exploited. To this end, the second direction relies on the deep fusion among textual, visual and layout information from a great number of unlabeled documents in different domains, where pre-training techniques play an important role in learning the cross-modality interaction in an end-to-end fashion(基于预训练技术的深度融合) (Lockard et al., 2020; Xu et al., 2020). In this way, the pre-trained models absorb cross-modal knowledge from different document types, where the local invariance among these layout and styles is preserved(通过这种方式,预先训练好的模型从不同的文档类型中吸收跨模式知识,保持了这些布局和样式之间的局部不变性). Furthermore, when the model needs to be transferred into another domain with different document formats, only a few labeled samples would be sufficient to fine-tune the generic model in order to achieve state-of-the-art accuracy. Therefore, the proposed model in this paper follows the second direction, and we explore how to further improve the pre-training strategies for the VrDU task.

    In this paper, we present an improved version of LayoutLM (Xu et al., 2020), aka LayoutLMv2. LayoutLM is a simple but effective pre-training method of text and layout for the VrDU task. Distinct from previous text-based pre-trained models, LayoutLM uses 2-D position embeddings and image embeddings in addition to the conventional text embeddings. During the pre-training stage, two training objectives are used, which are
    1) a masked visual-language model and
    2) multi-label document classification.
    The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset (Lewis et al., 2006), and achieves very promising results on several downstream tasks. Extending the existing research work, we propose new model architectures and pre-training objectives in the LayoutLMv2 model.
    Different from the vanilla LayoutLM model where image embeddings are combined in the fine-tuning stage, we integrate the image information in the pre-training stage in LayoutLMv2 by taking advantage of the Transformer architecture to learn the cross-modality interaction between visual and textual information. (与传统的LayoutLM模型在微调阶段结合图像嵌入不同,我们在LayoutLMv2的预训练阶段利用Transformer结构集成图像信息,学习视觉信息和文本信息之间的跨模态交互) In addition, inspired by the 1-D relative position representations (Shaw et al., 2018; Raffel et al., 2020; Bao et al., 2020), we propose the spatial-aware self-attention mechanism for the LayoutLMv2, which involves a 2-D relative position representation for token pairs. Different from the absolute 2-D position embeddings, the relative position embeddings explicitly provide a broader view for the contextual spatial modeling(此外,受一维相对位置表示的启发,我们提出了LayoutLMv2的空间感知自我注意机制,该机制涉及到token pairs的二维相对位置表示。与绝对二维位置嵌入不同,相对位置嵌入为上下文空间建模提供了更广阔的视角。). For the pre-training strategies, we use two new training objectives for the LayoutLMv2 in addition to the masked visual-language model. The first is the proposed text-image alignment strategy, which covers text-lines in the image and makes predictions on the text-side to classify whether the token is covered or not on the image-side(第一种是提出的文本图像对齐策略,该策略覆盖图像中的文本行,并在文本侧进行预测,以分类标记是否在图像侧被覆盖).
    The second is the text-image matching strategy that is popular in previous vision-language pre-training models (Tan & Bansal, 2019; Lu et al., 2019; Su et al., 2020; Chen et al., 2020; Sun et al., 2019), where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not(第二种是以往视觉语言预训练模型中流行的文本-图像匹配策略,将文本-图像对中的一些图像随机替换为另一个文档图像,使模型学习图像与OCR文本是否相关。). In this way, LayoutLMv2 is more capable of learning contextual textual and visual information and the cross-modal correlation in a single framework, which leads to better VrDU performance. We select 6 publicly available benchmark datasets as the downstream tasks to evaluate the performance of the pre-trained LayoutLMv2 model, which are the FUNSD dataset (Jaume et al., 2019) for form understanding, the CORD dataset (Park et al., 2019) and the SROIE dataset (Huang et al., 2019) for receipt understanding, the Kleister-NDA dataset (Gralinski et al., 2020) for long document understanding with complex layout, the RVL-CDIP dataset (Harley et al., 2015) for document image classification, as well as the DocVQA dataset (Mathew et al., 2020) for visual question answering on document images. Experiment results show that the Lay-outLMv2 model outperforms strong baselines including the vanilla LayoutLM and achieves new state-of-the-art results in these downstream VrDU tasks, which substantially benefits a great number of real-world document understanding tasks.

    The contributions of this paper are summarized as follows:

    • We propose a multi-modal Transformer model to integrate the document text, layout and image information in the pre-training stage, which learns the cross-modal interaction end- to-end in a single framework.
    • In addition to the masked visual-language model, we also add text-image matching and text-image alignment as the new pre-training strategies to enforce the alignment among dif- ferent modalities. Meanwhile, a spatial-aware self-attention mechanism is also integrated into the Transformer architecture.
    • LayoutLMv2 not only outperforms the baseline models on the conventional VrDU tasks, but also achieves new SOTA results on the VQA task for document images, which demon- strates the great potential for the multi-modal pre-training for VrDU. The pre-trained Lay- outLMv2 model is publicly available at https://aka.ms/layoutlmv2.

    2 APPROACH
    The overall illustration of the proposed LayoutLMv2 is shown in Figure 2. In this section, we will introduce the model architecture and pre-training tasks of the LayoutLMv2.

    2.1 MODEL ARCHITECTURE
    We build an enhanced Transformer architecture for the VrDU tasks, i.e. the multi-modal Transformer as the backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers. The model details are introduced as follows, where some dropout and normalization layers are omitted.
    Text Embedding We recognize text and serialize it in a reasonable reading order using off-the-shelf OCR tools and PDF parsers. Following the common practice, we use WordPiece (Wu et al., 2016) to tokenize the text sequence and assign each token to a certain segment si ∈ {[A], [B]}. Then, we add a [CLS] at the beginning of the token sequence and a [SEP] at the end of each text segment. The length of the text sequence is limited to ensure that the length of the final sequence is not greater than the maximum sequence length L. Extra [PAD] tokens are appended after the last [SEP] token to fill the gap if the token sequence is still shorter than L tokens. In this way, we get the input token sequence like
    S = {[CLS], w1, w2, …, [SEP], [PAD], [PAD], …}, |S| = L
    The final text embedding is the sum of three embeddings. Token embedding represents the token itself, 1D positional embedding represents the token index, and segment embedding(段embedding) is used to distinguish different text segments. Formally, we have the i-th text embedding
    ti = TokEmb(wi) + PosEmb1D(i) + SegEmb(si), 0 ≤ i < L
    Visual Embedding We use ResNeXt-FPN (Xie et al., 2016; Lin et al., 2017) architecture as the backbone of the visual encoder. Given a document page image I, it is resized to 224 × 224 then fed into the visual backbone. After that, the output feature map is average-pooled to a fixed size with the width being W and height being H. Next, it is flattened into a visual embedding sequence of length WH. A linear projection layer is then applied to each visual token embedding in order to unify the dimensions. Since the CNN-based visual backbone cannot capture the positional information, we also add a 1D positional embedding to these image token embeddings. The 1D positional embedding is shared with the text embedding layer. For the segment embedding, we attach all visual tokens to the visual segment [C]. The i-th visual embedding can be represented as
    vi = Proj(VisTokEmb(I)i) + PosEmb1D(i) + SegEmb([C]), 0 ≤ i < WH

    ??
    Layout Embedding The layout embedding layer aims to embed the spatial layout information represented by token bounding boxes in which corner coordinates and box shapes are identified explicitly(布局嵌入层的目标是嵌入由标记边界框表示的空间布局信息,其中角点坐标和框形状被明确标识).
    Following the vanilla LayoutLM, we normalize and discretize all coordinates to integers in the range [0, 1000], and use two embedding layers to embed x-axis features and y-axis features separately(将所有坐标标准化并离散为[0, 1000]范围内的整数,并使用两个嵌入层分别嵌入x轴特征和y轴特征).
    Given the normalized bounding box of the i-th text/visual token boxi =(x0,x1,y0,y1,w,h), the layout embedding layer concatenates six bounding box features to construct a token-level layout embedding, aka the 2D positional embedding(layout嵌入层将六个边界框特征连起来,构造token级布局嵌入,即二维特征嵌入)
    li =Concat(PosEmb2Dx(x0,x1,w),PosEmb2Dy(y0,y1,h)),0≤i<WH+L
    Note that CNNs perform local transformation, thus the visual token embeddings can be mapped back to image regions one by one with neither overlap nor omission(因为CNN执行局部的变换,因此视觉标记token嵌入可以一个接一个地映射回图像区域).

    In the view of the layout embedding layer, the visual tokens can be treated as some evenly divided grids, so their bounding
    box coordinates are easy to calculate.(在布局嵌入层的视图中,可视化token被视为一些均匀划分的网格,因此他们的边界box很容易被计算)

    An empty bounding box boxPAD =(0,0,0,0,0,0) is attached to special tokens [CLS], [SEP] and [PAD]

    Multi-modal Encoder with Spatial-Aware Self-Attention Mechanism(具有空间自我感知的自注意力机制的多模态编码器) The encoder concatenates visual embeddings {v0 , …, vW H −1 } and text embeddings {t0 , …, tL−1 } to a unified sequence X and fuses spatial information(融合空间信息) by adding the layout embeddings to get the first layer input x(0).

    Following the architecture of Transformer, we build our multi-modal encoder with a stack of multi-head self-attention layers followed by a feed-forward network. However, the original self-attention mechanism can only implicitly capture the relationship between the input tokens with the absolute position hints. In order to efficiently model local invariance in the document layout, it is necessary to insert relative position information explicitly(显示插入相对位置信息). Therefore, we introduce the spatial-aware(空间感知) self-attention mechanism into the self-attention layers. The original self-attention mechanism captures the correlation between query x{i} and key x{j} by projecting the two vectors and calculating the attention score.

    We jointly model the semantic relative position and spatial relative position as bias terms and explicitly add them to the attention score.Let b(1D), b(2Dx) and b(2Dy) denote the learnable 1D and 2D relative position biases respectively. The biases are different among attention heads but shared in all encoder layers. Assuming (xi , yi ) anchors the top left corner coordinates of the i-th bounding box, we obtain the spatial-aware attention score.

    Finally, the output vectors are represented as the weighted average of all the projected value vectors with respect to normalized spatial-aware attention scores

    ??

    2.2 PRE-TRAINING
    We adopt three self-supervised tasks simultaneously during the pre-training stage, which are described as follows.(预训练 采用三种自监督任务)

    Masked Visual-Language Modeling Similar to the vanilla LayoutLM, we use the Masked Visual-Language Modeling (MVLM) to make the model learn better in the language side with the cross-modality clues. We randomly mask some text tokens and ask the model to recover the masked tokens(随机mask一些token,并且让模型回复masked的token). Meanwhile, the layout information remains unchanged, which means the model knows each masked token’s location on the page. The output representations of masked tokens from the encoder are fed into a classifier over the whole vocabulary, driven by a cross-entropy loss. To avoid visual clue leakage, we mask image regions corresponding to masked tokens on the raw page image input before feeding into the visual encoder. MVLM helps the model capture nearby tokens features. For instance, a masked blank in a table surrounded by lots of numbers is more likely to be a number. Moreover, given the spatial position of a blank, the model is capable of using visual information around to help predict the token(在一个被许多数字包围的表中,一个隐藏的空白更可能是一个数字。此外,给定空白的空间位置,该模型能够利用周围的视觉信息来帮助预测标记。)

    Text-Image Alignment In addition to the MVLM, we propose the Text-Image Alignment (TIA) as a fine-grained cross-modality alignment task(提出图像对齐任务,一个细粒度的跨模态对齐任务). In the TIA task, some text tokens are randomly selected, and their image regions are covered on the document image(在TIA任务中,随机选择tokens,并在文档图像上覆盖它们的图像区域).
    We call this operation covering to avoid confusion with the masking operation in MVLM. During the pre-training, a classification layer is built above the encoder outputs. This layer predicts a label for each text token depending on whether it is covered(该层根据token是否被覆盖,给每个token预测一个标签), i.e., [Covered] or [Not Covered], and computes the binary cross-entropy loss. Considering the input image’s resolution is limited, the covering operation is performed at the line-level. When MVLM and TIA are performed simultaneously, TIA losses of the tokens masked in MVLM are not taken into account. This prevents the model from learning the useless but straightforward correspondence from [MASK] to [Covered].

    Text-Image Matching Furthermore, a coarse-grained cross-modality alignment task, Text-Image Matching (TIM) is applied during the pre-training stage(粗粒度跨模态对齐任务). We feed the output representation at [CLS] into a classifier to predict whether the image and text are from the same document page. Regular inputs are positive samples. To construct a negative sample, an image is either replaced by a page image from another document or dropped. To prevent the model from cheating by finding task features, we perform the same masking and covering operations to images in negative samples. The TIA target labels are all set to [Covered] in negative samples. We apply the binary cross-entropy loss in the optimization process.

    2.3 FINE-TUNING
    LayoutLMv2 produces representations with fused cross-modality information, which benefits a variety of VrDU tasks. Its output sequence provides representations at the token-level(它的输出序列提供了token级别的序列表示). Specifically, the output at [CLS] can be used as the global feature. For many downstream tasks, we only need to build a task specified head layer over the LayoutLMv2 outputs and fine-tune the whole model using an appropriate loss. In this way, LayoutLMv2 leads to much better VrDU performance by integrating the text, layout, and image information in a single multi-modal framework, which significantly improves the cross-modal correlation compared to the vanilla LayoutLM model(通过这种方式,LayoutLMv2通过将文本、布局和图像信息集成到一个单一的多模态框架中,从而获得更好的VrDU性能,与vanilla LayoutLM模型相比,这显著提高了跨模态相关性。).

    3 EXPERIMENTS

    3.1 DATA
    In order to pre-train and evaluate LayoutLMv2 models, we select datasets in a wide range from the visually-rich document understanding area. Introduction to the dataset and task definitions along with the description of required data pre-processing are presented as follows.

    FUNSD FUNSD (Jaume et al., 2019) is a dataset for form understanding in noisy scanned documents. It contains 199 real, fully annotated, scanned forms where 9,707 semantic entities are annotated above 31,485 words. The 199 samples are split into 149 for training and 50 for testing. The official OCR annotation is directly used with the layout information. The FUNSD dataset is suitable for a variety of tasks, where we focus on semantic entity labeling in this paper. Specifically, the task is assigning to each word a semantic entity label from a set of four predefined categories: question, answer, header or other. The entity-level F1 score is used as the evaluation metric.

    3.2 SETTINGS
    Following the typical pre-training and fine-tuning strategy, we update all parameters and train whole models end-to-end for all the settings.
    Pre-training LayoutLMv2
    We train LayoutLMv2 models with two different parameter sizes.
    We set hidden size d = 768 in LayoutLMv2BASE and use a 12-layer 12-head Transformer en-coder. While in the LayoutLMv2LARGE, d = 1024 and its encoder has 24 Transformer layers with 16 heads.
    Visual backbones in the two models use the same ResNeXt101-FPN architecture. The numbers of parameters are 200M and 426M approximately for LayoutLMv2BASE and LayoutLMv2LARGE, respectively.
    The model is initialized from the existing pre-trained model checkpoints. For the encoder along with the text embedding layer, LayoutLMv2 uses the same architecture as UniLMv2 (Bao et al., 2020), thus it is initialized from UniLMv2.
    For the ResNeXt-FPN part in the visual embedding layer, the backbone of a Mask-RCNN (He et al., 2017) model trained on PubLayNet (Zhong et al., 2019) is leveraged.3 The rest of the parameters in the model are randomly initialized. We pre-train LayoutLMv2 models using Adam optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2019), with the learning rate of 2 × 10−5, weight decay of 1 × 10−2, and (β1, β2) = (0.9, 0.999). The learning rate is linearly warmed up over the first 10% steps then linearly decayed. LayoutLMv2BASE is trained with a batch size of 64 for 5 epochs, and LayoutLMv2LARGE is trained with a batch size of 2048 for 20 epochs on the IIT-CDIP dataset.
    During the pre-training, we sample pages from the IIT-CDIP dataset and select a random sliding window of the text sequence if the sample is too long. We set the maximum sequence length L = 512 and assign all text tokens to the segment [A]. The output shape of the adaptive pooling layer is set to W = H = 7, so that it transforms the feature map into 49 image tokens. In MVLM, 15% text tokens are masked among which 80% are replaced by a special token [MASK], 10% are replaced by a random token sampled from the whole vocabulary, and 10% remains the same. In TIA, 15% of the lines are covered. In TIM, 15% images are replaced and 5% are dropped.

    Fine-tuning LayoutLMv2 for Visual Question Answering We treat the DocVQA as an extractive QA task and build a token-level classifier on top of the text part of LayoutLMv2 output representations. Question tokens, context tokens and visual tokens are assigned to segment [A], [B] and [C], respectively. In the DocVQA paper, experiment results show that the BERT model fine-tuned on the SQuAD dataset (Rajpurkar et al., 2016) outperforms the original BERT model. Inspired by this, we add an extra setting, which is that we first fine-tune LayoutLMv2 on a Question Generation (QG) dataset followed by the DocVQA dataset. The QG dataset contains almost one million question-answer pairs generated by a generation model trained on the SQuAD dataset.