图像分割 - Towards Enhancing Fine-grained Details for Image Matting - 《深度学习论文阅读》

前言
针对解决的问题
提出的方法
- 有效补充细节信息的模型
  - Textural Compensate Path (TCP)
- 提升对trimap的稳健性
实验结果
链接

被WACV 2021接收

前言

这篇论文整体来看思路比较直接，主要针对两点：

细节特征的针对性优化
增强对于质量较低的trimap的适应能力

其中第一点的思路的实现策略比较类似最近的MODNet。但是也其实不太一样。

虽然都在在一个单独的高分辨率分支上进行细节上的调整，不过思路细节还是不同的，这里没有过多引入MODNet中强调的语义特征，更多是使用了语义分支编码器中的浅层特征。
而且两路分支的结合方式也是不同的。本文更多强调二者的预测上的整合，而MODNet则是强调特征上的交互以及各分支之间的配合（互相监督约束，构造了一种自监督策略，以适应新域的数据，对真实数据的迁移效果较好）。
对于显示数据迁移的改进方向不同。
- 本文还是trimap-based的模型，要依赖于用户构造的trimap，但是对于其质量包容性好些，主要借助的手段是训练时针对trimap的增强策略，以及损失设计上在未知区域的监督之外，增加了对背景区域的专门监督。
- 而MODNet则是trimap-free的模型，所以他针对的迁移则是更加一般的图片输入。不光是模型设计本身使用了IBN的策略（即同时使用Instance Normalization和Batch Normalization，以增强模型的域适应能力），也在新的数据集上可以进行自监督训练，从而进一步提升迁移效果。
模型侧重点也不同，MODNet更强调速度，并可以在视频上进行处理。本文更多看上去还是像针对数据集够造的模型。
- 论文：https://arxiv.org/pdf/2011.11961.pdf
- 代码：https://github.com/ZHKKKe/MODNet

针对解决的问题

当前方法常用的在早期就大幅度下采样的编解码结构对于matting非常需要的细节信息的恢复并不友好。

We argue that recovering these microscopic details relies on low-level but high-definition texture features. However, these features are downsampled in a very early stage in current encoder-decoder-based models, resulting in the loss of microscopic details.

Although some previous encoder-decoder based works have explored to restore the lost details by postprocessing or refinement modules such as extra convolution layers [Deep image matting] and LSTM units [Disentangled image matting] in a cascading manner, it is very hard to reconstruct the already lost fine-grained details.
In addition, some simply stacked strutures will bring extra difficulties on the training of the network, e.g. making the network cannot be trained end-to-end.

To address this issue, we design a deep image matting model to enhance fine-grained details/keep as many spatial details as possible for image matting.

提出的方法

要注意，本文的方法不是trimap-free的结构，还是trimap-based方法。

TCP结构，主要针对细节纹理信息

有效补充细节信息的模型

The model consists of two parallel paths
- a conventional encoder-decoder Semantic Path
  - the encoder part is built on the basis of the ResNet-34, and the decoder part is built as a mirror structure of the encoder.
- an independent downsampling-free Textural Compensate Path (TCP)
  - to extract fine-grained details such as lines and edges in the original image size, which greatly enhances the fineness of prediction
  - to extract pixel-to-pixel high-definition information from features whose size is as the same as the original image, aiming to compensate for the lost pixel-to-pixel feature caused by the early downsampling in the encoder-decoder architecture in the Semantic Path.
    Textural Compensate Path (TCP)
结构
- The spatial feature extraction unit, which is formed by one convolution layer followed by two residue blocks, aiming to extract rich pixel-level structural features. This module is downsampling-free, resulting in the output size to be HxW. At the same time, intermediate features from the Semantic Path are taken and resized to HxW, the same as the output of the spatial feature extraction unit.
- Next, these two sets of features are sent to the Feature Fusion Unit (FFU) to leverage the benefits of high-level context. This step is to provide multi-scale and pretrain information in addition to the pixel-level spatial features.
  - In order to introduce multi-scale features while keeping the parameter size controllable, we borrow the intermediate features from the semantic path as multi-scale features. At the same time, to ensure that the TCP focuses on low-level features, features are taken from very shallow layer: the second layer in the U-Net semantic path, for fusion. The features are firstly resized to the original image size using nearest interpolation.
  - Since the feature representation in two paths can also be very different, simply adding the features from different path can be harmful to training. Thus, we multiply the weight from semantic path by a learnable weight wc to control its influence.
- Then, fused features are sent to the feature refinement unit that consists 2 convolution layers, generating the output of TCP.
The proposed network takes a 6-channel maps as the input.
- formed by the concatenation of the 3-channel RGB image and the corresponding one-hot encoded 3-channel trimap.
- The input is sent to the semantic path and textural compensate path simultaneously, where each path generates a one-channel output.
- Then, the tanh value of the sum of the two outputs is the output of the network, i.e., the predicted alpha matte.
  提升对trimap的稳健性
  面临的现实
  Trimap is usually supposed to be provided by users. However, the currently most widely-used dataset Adobe Deep Image Matting Dataset does not provide trimaps for training and requires models to generate trimap by themselves. In practice, the user-provided trimaps might be very coarse because an-notating trimap is a very bothering process, especially for unprofessional users.
  We have observed that:
for a number of images in the Composition-1k testing set, nearly the whole trimaps are annotated as “unknown region”, which means that the trimap is very coarse and almost cannot provide any useful interaction information.
In contrast, for training set, model-generated trimaps are usually based on the ground-truth alpha map, resulting in very high quality. This causes the inconsistencies between training and testing.

Here we propose two methods to give the model more robustness on handling different kinds of trimaps.

a trimap generation method

当前的trimap生成策略：根据对应的真值alpha map依据阈值划分，从而确定trimap的三个类别的初始区域，之后对unknown区域使用膨胀来扩大面积（也就是腐蚀前景与背景）。

这里需要面对的问题
- 过大的膨胀核会损坏上下文信息
- 过小则会使得训练与测试期间的trimap的不一致性扩大

为了缓解这一问题，提出了一种针对trimap的数据增强策略：

When training
- trimaps are first generated by the process mentioned above with a relatively small erosion kernel size (randomly chosen between 1x1 and 30x30) to keep more contextual information. This trimap is used as a part of the input of the semantic path.
- Next, we apply extra n steps of random morphological operations to the unknown region of the semantic path trimap to simulate the randomicity in noisy trimaps provided by users.
  - Each step is randomly chosen from a p-iteration erosion and a p-iteration dilation
  - where n and p are random numbers between 0 and 3.
  - For each step, the kernel size is randomly chosen from 1x1 to 30x30 for dilation and from 1x1 to 10x10 for erosion.
  - This noisier trimap is used as the input of the textural compensate path.
Then when inferring, the user-provided trimap is used for both paths.
a novel term in loss function
针对alpha map预测常用的损失是来自[Deep image matting]的alpha-prediction loss：p

U表示trimap中的unknown区域，即该损失只关注与unknown区域
但是该损失存在问题：

One thing to notice here is that this alpha-prediction loss only considers the unknown region in the trimap and ignores the contents in the absolute foreground and back-ground regions. This characteristic makes the network easier to train because it reduces the solution space by filling the absolute background and foreground with value 0 or 1 according to the trimap after prediction. However, this brings a significant drawback: lots of contextual information are lost, causing the network hard to handle the “pure” background inside the unknown region. Some works address this issue by deriving a more accurate trimap [Disentangled image matting]. However, this will bring extra complexities to the network design.

However, this will bring extra complexities to the network design. Instead, we propose another auxiliary loss, Background Enhancement Loss. This loss term recognizes the “pure” background inside the unknown region, and make use of these areas to give contextual guidance for the network.

Rbg is the “absolute” background part inside the unknown region
Nbg is the number of pixels of Rbg
is the background threshold to control the size of Rbg

The full loss of the network is then a weighted sum of the two loss terms:

Note that though currently the dataset we used is synthetic, only images and trimaps that are already synthesized are used in training. This makes our network available to work on both synthetic and real-world datasets.
这句话来自原论文，但是有着有点懵。感觉是在说对于trimap的稳健性使得其在合成数据和真实数据上都可以表现不错。