https://doi.org/10.1016/j.compeleceng.2020.106755
    摘要
    In recent years, the tracking model based on the Siamese Network has been widely used in the object tracking field to model the object tracking task as a similarity matching prob- lem, which balances the tracking speed and accuracy. 近年来,基于Siamese Network(孪生神经网络)的跟踪模型被广泛应用于目标跟踪领域,将目标跟踪任务建模为相似性匹配问题,这平衡了跟踪速度和准确性。However, there are insufficient robustness, discriminative ability and generalization ability for object deformation and complex background interference.然而,对物体变形和复杂背景干扰的鲁棒性、判别能力和泛化能力不足。 In this paper, an improved Fully-convolutional Siamese Network is proposed.本文提出了一种改进的全卷积孪生神经网络。 The Triplet Loss function is used as the model objective function instead of logistic loss, and the multi-channel attention mechanism is introduced to make the model pay more attention to the tracking related information and enhance the model discriminating ability.使用三重损失函数代替逻辑损失作为模型目标函数,引入多通道注意机制,使模型更加关注跟踪相关信息,增强模型的识别能力。 In the offline training phase, an effective data augmentation strategy is used to control the uneven distribution of sample categories and improve the generalization ability of the model. 在离线训练阶段,采用有效的数据增强策略来控制样本类别的不均匀分布,提高模型的泛化能力。In the tracking phase, the Distractor-aware module is used to transfer the general feature representation domain to a specific object domain, thereby improving model discriminating ability. 在跟踪阶段,干扰感知模块用于将一般特征表示域转移到特定对象域,从而提高模型识别能力。In experiments, the results on VOT2016 tracking benchmark shows that our model has a significant improvement over the SiamFC tracker in multiple evaluation indicators. 在实验中,VOT2016跟踪基准的结果表明,我们的模型在多个评估指标方面比SiamFC跟踪器有显著改进。
    介绍
    Visual object tracking, which given an arbitrary object in the first frame of the video, the target is to find the position of the located object in each frame of the video subsequence [1] . 视觉对象跟踪,在视频的第一帧中给定一个任意对象,目标是找到定位对象在视频子序列的每一帧中的位置。Object tracking are one of the most challenging and basic problem in many computer vision topics such as video surveillance, autonomous vehicles, intelligent traffic monitoring and robotic visual tracking [2] .目标跟踪是许多计算机视觉主题中最具挑战性和最基本的问题之一,如视频监控、自动驾驶汽车、智能交通监控和机器人视觉跟踪。 There are two main reasons for the difficulty of efficient object tracking. 目标跟踪的难度主要有两个原因。First, tracking object are affected by practical factors such as scale changes, rapid motion, occlusion, deformation, and background clutter [3] .首先,跟踪对象受尺度变化、快速运动、遮挡、变形和背景杂波等实际因素的影响。 Second, the object tracking task often only gives the tracking object in the first frame of the video sequence, it is limited the tracker to learn the feature of the object. 其次,对象跟踪任务通常只在视频序列的第一帧中给出跟踪对象,它限制了跟踪器学习对象的特征。Therefore, it is difficult to build a high-performance object tracking model to achieve both real-time and accuracy [4] . 因此,很难建立高性能的对象跟踪模型来实现实时和准确。
    In recent years, the methods for successful application in the object tracking field are mainly divided into the following three categories. 近年来,在目标跟踪领域成功应用的方法主要分为以下三类。The first category is object tracking model based on correlation filter, which an relatively efficient performance can be achieved in both tracking accuracy and real-time. 第一类是基于相关滤波器的目标跟踪模型,在跟踪精度和实时性上都可以实现相对有效的性能。The correlation filter algorithm learns to distinguish the difference between the foreground object and the surrounding background by solving the ridge regression problem very efficiently [5] . 相关滤波算法通过非常有效地解决ridge regression(岭回归)问题来区分前景对象和周围背景之间的差异。It turns out that it is very successful in object tracking tasks. 事实证明,它在对象跟踪任务中非常成功。The main reason for its high efficiency is the re- placement of time-consuming convolution operation with element-by-element multiplication in fast Fourier transform, and the image features adopted are relatively simple [6] . 其高效率的主要原因是在快速傅里叶变换中使用逐个元素乘法重新放置耗时的卷积运算,并且采用的图像特征相对简单 [6]。However, an obvious disadvantage of correlation filters is that they are trained using data only from the current video, so only relatively simple models can be learned [7] .然而,相关滤波器的一个明显缺点是,它们仅使用来自当前视频的数据进行训练,因此只能学习相对简单的模型 [7]。
    The second category of object tracking is based on CNN(Convolutional neural network).第二类目标跟踪是基于CNN (卷积神经网络)。 CNN have made unprecedented progress in image classification, target detection, etc. [8] computer visual tasks. CNN在图像分类、目标检测等方面取得了前所未有的进展 [8] 计算机视觉任务。They also significantly improve the state of the art in object tracking. 它们还显著提高了物体跟踪的技术水平。The reason for the performance improvement is mainly due to it powerful feature representation ability. 性能提高的原因主要是由于它强大的特征表示能力。There are two main ways to use this powerful feature in the current visual tracking task. Firstly, CNN is trained as a feature extractor to provide powerful feature representation capabilities for the object tracking model [9,10]. 在当前的视觉跟踪任务中使用此强大功能的主要方法有两种。首先,训练CNN作为特征提取器,为对象跟踪模型提供强大的特征表示能力 [9,10]。Secondly, CNN is trained as a binary classifier on a large image dataset to classify the foreground object and the surrounding background [11]. 其次,将CNN训练为大型图像数据集上的二进制分类器,以对前景对象和周围背景进行分类 [11]。However, the CNN based tackers adopted online training to update the parameters of the network, due to the large number of CNN parameters, performing online training is computationally expensive. 然而,基于美国有线电视新闻网的CNN采用在线培训来更新网络参数,由于CNN参数数量众多,执行在线培训的计算成本很高。As a result, most trackers based on CNN run much slower than real-time 因此,大多数基于美国有线电视新闻网的追踪器的运行速度比实时慢得多。
    The third category is object tracking model based on Siamese network. 第三类是基于孪生神经网络的目标跟踪模型。Recently, object tracking based on Siamese network has drawn a lot of attention in the visual tracking community because it can obtain a good balance between accuracy and speed.最近,基于孪生神经网络的目标跟踪在视觉跟踪社区引起了很多关注,因为它可以在准确性和速度之间获得很好的平衡。 Siamese network is a kind of neural network architecture which contains two or more identical sub-network, and the parameters are shared among all sub-networks. 孪生神经网络是一种神经网络架构,包含两个或多个相同的子网,参数在所有子网之间共享。SiamFC [12] model the object tracking task as a similarity matching problem and adopt the fully-convolutional network as the sub-network. SiamFC [12] 将对象跟踪任务建模为相似性匹配问题,并采用全卷积网络作为子网。It is achieved a superior performance in terms of speed and accuracy of tracking. 它在速度和跟踪精度方面实现了卓越的性能。Whats more, the advantage of using a fully-convolutional network is that a larger search image can be provided as input to the network instead of only providing the search image of same size as the example image. 更重要的是,使用全卷积网络的优点是可以提供更大的搜索图像作为网络的输入,而不是仅提供与示例图像大小相同的搜索图像。And it will calculate the similarity of all transformed sub-windows on the dense grid in one evaluation. 它将在一次评估中计算密集网格上所有转换子窗口的相似性。After SiamFC, many SiamFC-based object trackers has been proposed. CFNet [13] added a correlation filter in SiamFC network to speed up tracking without compromising accuracy. 在SiamFC之后,已经提出了许多基于SiamFC的对象跟踪器。CFNet [13] 在SiamFC网络中添加了相关滤波器,以在不影响准确性的情况下加快跟踪速度。RASNet [14] introduced different types of attention mechanisms to enhancing its discriminative ability and adaptability without updating the model online. RASNet [14] 介绍了不同类型的注意机制,以增强其辨别能力和适应性,而无需在线更新模型。SA-Siam [15] trained two Siamese networks, one for semantic features extraction and another one for appearance features extraction, because those two types features can complement each other. SA-Siam [15] 训练了两个孪生神经网络,一个用于语义特征提取,另一个用于外观特征提取,因为这两种类型的特征可以相互补充。 On the basis of similarity learning, GOTURN [16] model the object tracking as a bounding box regression problem. 在相似性学习的基础上,GOTURN [16] 将对象跟踪建模为边界盒回归问题。In the online tracking phase, the parameters in the network are fixed without online update, so it can achieve 100FPS tracking speed, but it is relatively poor in tracking accuracy. 在线跟踪阶段,网络中的参数是固定的,无需在线更新,因此可以达到100FPS的跟踪速度,但跟踪精度相对较差。SINT [17] combines optical flow information to provide richer feature representation. SINT [17] 结合了光流信息以提供更丰富的特征表示。However, the optical flow operation is computationally expensive, SINT operates at only 4FPS.然而,光流操作在计算上是昂贵的,SINT的操作速度仅为4FPS。 DSiam [18] enables effective online learning of target appearance variation and the background suppression from previous frames, and present elementwise muti-layer fusion to adaptively integrate the network outputs using multi-level deep features. DSiam [18] 能够有效地在线学习目标外观变化和来自先前帧的背景抑制,并呈现元素多层融合,以使用多级深层特征自适应地整合网络输出。Siamese-RPN [19] and DA-SiameseRPN [20] consists of a Siamese subnetwork for feature extraction and a RPN (Region Proposal Network) subnetwork for bounding box regression, which achieves a very superior performance in terms of speed and accuracy. Siamese-RPN [19] 和DA-SiameseRPN [20] 由用于特征提取的孪生神经网络和用于边界盒回归的RPN (区域建议网络) 子网络组成,这在速度和准确性方面实现了非常优越的性能。
    In this paper, we proposed a real-time object tracking based on improved Fully-convolutional Siamese network, the model architecture is shown in Fig. 1.1.在本文中,我们提出了一种基于改进的全卷积孪生神经网络的实时对象跟踪,模型架构如图1.1所示。 Compared with the traditional object tracking framework based on Siamese network, the Triplet loss function is adopted as the objective function of the model for training. 与传统的基于孪生神经网络的目标跟踪框架相比,采用三重损失函数作为模型的目标函数进行训练。At the same number of inputs, Triplet loss can be trained with more elements, and a more powerful feature can be achieved through a combination of original samples.在相同数量的输入下,可以使用更多元素来训练三重态损耗,并且可以通过组合原始样本来实现更强大的功能。 Moreover, the multi-channel attention mechanism is introduced, and it is implemented by a feed forward neural network which means it can be trained in offline phase. 此外,引入了多通道注意机制,并通过前馈神经网络实现,这意味着它可以在离线阶段进行训练。The attention mechanism can make the model pay more attention to the information related to the specific tracking object which can enhance the discriminative power of the tracker. 注意机制可以使模型更加注意与特定跟踪对象相关的信息,从而增强跟踪器的识别能力。In addition, we adopted VGG16 as the backbone CNN for feature extraction instead of AlexNet used in most SiamFC-based trackers.此外,我们采用VGG16作为CNN的主干,用于特征提取,而不是大多数基于SiamFC的跟踪器中使用的AlexNet。 Since the VGG-like network can be pre-trained on a large image classification dataset, which allows us to adapt them from image classification to object tracking. 由于VGG样网络可以在大型图像分类数据集上进行预训练,这使我们能够使它们从图像分类适应对象跟踪。Besides, in the offline training phase, an effective data augmentation strategy is introduced to control the uneven distribution of sample categories. 此外,在离线培训阶段,引入了一种有效的数据增强策略来控制样本类别的不均匀分布。Through data augmentation techniques(translation, resizing, grayscale, etc.), still images from the detected dataset can generate image pairs for training, thereby greatly increasing the categories of sample pairs and improve the generalization ability of the model. 通过数据增强技术 (平移、调整大小、灰度等),来自检测到的数据集的静态图像可以生成用于训练的图像对,从而大大增加了样本对的类别,提高了模型的泛化能力。In the tracking phase, the Distractor-aware module is used to transfer the general feature representation domain to a specific object domain, thereby improving model discriminating ability. 在跟踪阶段,干扰感知模块用于将一般特征表示域转移到特定对象域,从而提高模型识别能力。
    The rest of the paper is organized as follows. We first revisiting the Fully-convolutinal Siamese network in Section 2. Our approach is described in Section 3. The experimental results are presented in Section 4. Finally, Section 5 concludes the paper .论文的其余部分组织如下。我们首先在第2节中回顾全卷积孪生神经网络。我们的方法在第3节中进行了描述。实验结果在第4节中呈现。最后,第5节对论文进行了总结。
    image.png
    Fig. 1.1. Architecture of Improved Fully-convolutiaonal Siamese Network. 改进的全卷积孪生神经网络的架构。The input is a sample pairs(an exemplar image and a search image), which are 127 × 127 and 255 × 255, respectively. 输入是样本对 (示例性图像和搜索图像),分别为127 × 127和255 × 255。The feature map output by the exemplar image branch is processed by the channel attention module, then the weighted feature map and the feature map of the search image branch are cross-correlated, and than the Score Map is obtained.示例图像分支输出的特征映射由通道注意模块处理,然后加权特征映射和搜索图像分支的特征映射交叉关联,并获得分数图。 In the training phase, Triplet loss is used as the objective function for training 在训练阶段,三要素损失被用作训练的目标函数

    Revisiting the fully-convolutional siamese network for tracking 重新审视用于跟踪的全卷积孪生神经网络
    The core idea of the Fully-Convolutionnal Siamese network for tracking is to model the tracking task as a similarity matching problem.用于跟踪的全卷积生神经网络的核心思想是将跟踪任务建模为相似性匹配问题。 The model architecture is shown in Fig. 2.1. Specifically, the exemplar image Z and the search image X are input into two branches of the network, and two feature maps are respectively generated by the embedding function ϕ.模型架构如图2.1所示。具体地,将示例性图像Z和搜索图像X输入到网络的两个分支中,并且通过嵌入函数 ϕ 分别生成两个特征图。 Then, a cross-correlation operation is performed on these two feature maps to obtain a 17 × 17 × 1score map.然后,对这两个特征图进行互相关运算,得到17 × 1的得分图。 The pixels in the red portion of the middle of the score map are positive scores(M positive scores), and the surrounding blue area are negative scores (N negative scores). 分数图中间红色部分的像素为正分数 (M阳性分数),周围的蓝色区域为负分数 (N阴性分数)。During training phase, the model is trained using a logistical loss function. Next, we will describe the details of SiamFC. 在训练阶段,使用后勤损失函数训练模型。接下来,我们将描述SiamFC的详细信息。
    image.png
    Fig. 2.1. Architecture of Fully-Convolutional Siamese network for tracking. 用于跟踪的全卷积孪生神经网络的体系结构。
    2.1 Fully-Convolutional siamese network 全卷积神经网络
    The Siamese network structure was first proposed for face recognition task [21]. 孪生神经网络结构首次被提出用于人脸识别任务The input of the network is a sample pair (X1, X2), and X1 and X2 are respectively input into two branches of the network(weight parameters are shared between the two branches), corresponding outputs (G(X1), G(X2)). 网络的输入是样本对 (X1,X2),并且X1和X2分别输入到网络的两个分支中 (权重参数在两个分支之间共享),相应的输出 (G(X1),G(X2))。Then, the Euclidean distance Ew between G(X1) and G(X2) is calculated. 然后,计算G(X1) 和G(X2) 之间的欧几里得距离Ew。If the Ew value is smaller, the probability that X1 and X2 belong to the same category is higher, otherwise lower. 如果Ew值较小,则X1和X2属于同一类别的概率较高,否则较低。Then, Fully-Convolutiaonal Siamese network is proposed which represents the tracking problem as a similarity matching learning in the embedding space. 然后,提出了全卷积孪生神经网络,将跟踪问题表示为嵌入空间中的相似性匹配学习。The tracking object image patch is typically given in the first frame of the video sequence and is referred to as an exemplar. 跟踪对象图像补丁通常在视频序列的第一帧中给出,并称为示例性。The purpose of tracking is to find the image patch with the greatest similarity to the exemplar from each subsequence video frames in the semantic embedding space, which is referred as a candidate image(instance). 跟踪的目的是从语义嵌入空间中的每个子序列视频帧中找到与示例具有最大相似性的图像补丁,这被称为候选图像 (实例)。How to learn a powerful embedded function is the key to addressing this kind of problem.如何学习强大的嵌入式功能是解决这类问题的关键。 SiamFC apply a Fully-convolutional network as an embedding function.SiamFC应用全卷积网络作为嵌入函数。The network input is an image pair (z, x) corresponding to the input of the two branches of the network, where z corresponds to the exemplar image patch, which is the image patch containing the tracking object context information given in the first frame, x corresponds to the candidate image patch in the search image of the current frame. 网络输入是对应于网络的两个分支的输入的图像对 (z,x),其中z对应于示例图像补丁,这是包含第一帧中给出的跟踪对象上下文信息的图像补丁,x对应于当前帧的搜索图像中的候选图像补丁。For different inputs, the two network branches can be treated as an identical transformation ϕ(•)because they share the same parameters. 对于不同的输入,两个网络分支可以被视为相同的变换 ϕ(•),因为它们共享相同的参数。Denote the output of these two branches as φ(z)&φ(x) respectively, the similarity function is defined as 将这两个分支的输出分别表示为 φ(z) 和 φ(x),相似性函数定义为
    f(z, x) = φ(z) ∗ φ(x) + b, (1)
    Where b is corresponds to the bias of each candidate image patches, and∗denote the cross- correlation operation. 其中b对应于每个候选图像补丁的偏差,并且 ∗表示互相关操作。The output of network is a score map rather than a single score. 网络的输出是 score map 而不是单个 score。In the training phase, the score map are divided into positive scores and negative scores, and the logistic loss is applied to train the model on the score map, which is formulated as: 在训练阶段,分数图分为阳性score和阴性score,并应用逻辑损失在score map上训练模型,其表述为:
    (2)
    Where y, v, χdenote the sets of ground-truth label, similarity score, instance input respectively. 其中y、v、 χ 分别表示ground-truth label(参考标准人为给定的标签)、similarity score(相似度)、实例输入的集合。vi is the similarity score of (z, xi) i.e.vi = f(z, xi). yi ∈ { + 1, −1}is the ground-truth label of a single exemplar-instance pair (z, xi). wi is the balance weight for an instance xi, and xi∈χ wi = 1, wi > 0, xi ∈ χ, the balance weights is defined on the number of positive and negative instances in SiamFC [22]. 平衡权重是根据SiamFC中正负实例的数量定义的。
    2.2 Insufficient of fully-convolutional siamese network 全卷积孪生神经网络不足
    Although SiamFC achieves good performance, there are some weaknesses: 尽管SiamFC取得了良好的表现,但仍存在一些弱点:(1) SiamFC only utilizes the pairwise sample relationships (positive sample pairs, negative sample pairs) and ignore the potential relationship between positive and negative instances.SiamFC仅利用成对样本关系 (正样本对、负样本对) 而忽略正负实例之间的潜在关系。 (2) the uneven distribution of categories training sample leads to poor generalization.分类训练样本分布不均导致泛化差。 (3) SiamFC tracker tend to perform poorly when there is a problem with intra-class interference around the tracking target. 当跟踪目标周围的类内干扰出现问题时,SiamFC跟踪器往往表现不佳。
    In this paper, in order to solve the above-mentioned main problems of SiamFC, a novel object tracking model based on improved Fully-convolutional Siamese network is proposed. 为了解决SiamFC的上述主要问题,提出了一种基于改进的全卷积Siamese网络的目标跟踪模型。The improved model enhances the generalization ability and discriminative ability of the model, solves the problem of unevenness of the training sample categories, and tracks the robustness of the object. 改进后的模型增强了模型的泛化能力和判别能力,解决了训练样本类别的不均匀性问题,跟踪了对象的鲁棒性。
    3. Improved fully-convolutional siamese network for tracking 用于跟踪的改进的全卷积孪生神经网络
    The architecture of the Improved Fully-convolutional Siamese network is shown in Fig. 1.1. In the training phase, the Triplet loss [23] is adopted to replace the logistic loss function in the SiamFC tracking model. 改进的全卷积孪生神经网络的架构如图1.1所示。在训练阶段,SiamFC跟踪模型中采用三重损失 [23] 代替逻辑损失函数。The Triplet loss contains more elements, which helps to mine more potential relationships between the exemplar, the positive instance and the negative instance. 三重损失包含更多元素,这有助于挖掘示例性、正实例和负实例之间更多的潜在关系。Moreover, a channel attention module is introduced, which can learn different feature channel weight coefficients in the model training process, thereby achieving adaptive tracking of different objects. 此外,引入了信道注意模块,该模块可以在模型训练过程中学习不同特征的信道权重系数,从而实现对不同对象的自适应跟踪。In addition, a largescale ImageNet detection [24] dataset and COCO detection [25] dataset were introduced. 此外,还引入了大尺度图像网检测 [24] 数据集和COCO 检测 [25] 数据集。Through the data augmentation strategy, the training set was extended to solve the imbalance of object categories distribution.通过数据增强策略,对训练集进行了扩展,以解决对象类别分布的不平衡问题。 In the online tracking phase, an Distractor-aware module is introduced, which effectively suppresses the distractors in the background, and makes the model more robust to the tracking object. 在在线跟踪阶段,引入了干扰感知模块,该模块有效地抑制了背景中的干扰,并使模型对跟踪对象更具鲁棒性。
    3.1 Triplet loss in siamese network 全卷积孪生神经网络中的三重损失
    We can split the instance set χ in SiamFC into a positive instance set χp and a negative instance set χn. 我们可以将SiamFC中的实例集 χ 拆分为正实例集 χ p和负实例集 χ n。Considering other exemplar inputs, we can construct a triple tuples using SiamFC’s input, i.e. a tuple contains exemplar, positive and negative instances. 考虑到其他示例输入,我们可以使用SiamFC的输入构造一个三元组,即元组包含示例、正实例和负实例。However, SiamFC only uses pairwise losses and ignores the potential relationship between positive and negative instances.然而,SiamFC仅使用成对损失,而忽略了正负实例之间的潜在关系。 In this case, a Triplet loss is adopted to exploit the potential relationship between inputs. 在这种情况下,采用三重损耗来利用输入之间的潜在关系。Since the instance set χ can be divided, the similarity score set ν of Siamese network may also be divided into a positive score set νpand a negative score setνn. 由于可以划分实例集 χ,因此孪生神经网络的相似性得分集 ν 也可以分为正得分集 ν p和负得分集 ν n。Then, we can directly define the Triple loss on these score-pairs (vpi,vnj). 然后,我们可以直接定义这些得分对的三重损失 (vpi,vnj)。We adopted a matching probability to measure each score pair. The matching probability equation is defined as follows. 我们采用匹配概率来测量每个分数对。匹配概率方程定义如下。
    image.png
    The expectation during the model training process is that the higher the positive score, the better, and the lower the negative score, the better.模型训练过程中的期望是,阳性分数越高越好,阴性分数越低越好。 Therefore, the goal of the Triplet loss function is to maximize the joint probability between all score pairs. 因此,三重损失函数的目标是最大化所有得分对之间的联合概率。By taking the logarithm of the joint probability, the objective function as shown in Eq. (4). 通过取联合概率的对数,目标函数如公式 (4) 所示。
    image.png
    image.png
    Fig. 3.1. Structure of Triplet loss layer. 三重态损耗层的结构。We generate M × N score-pairs from positive and negative scores in Score Map, which the red rectangle with size M × N generated by repeating M positive scores N times and the blue rectangle is generated by repeating N negative scores M times. 我们从分数图中的正负分数生成M × N分数对,其中,通过重复M个正分N次生成大小为M × N的红色矩形,通过重复N个负分M次生成蓝色矩形。Finally, the score pairs used for the Triplet loss to calculate the loss of the network. 最后,用于三元组损失的分数对计算网络损失。
    image.png
    Fig. 3.2. Channel attention module. Channel 注意模块 The feature map UWxHxC is input to Channel Attention Module, through it generates the weighing coefficient β1x1xC, then β1x1xC weighted on UWxHxC to obtain the weighted feature mapimage.png特征地图UWxHxC输入到通道注意模块,通过它生成称重系数 β1x1xC,则 β1x1xC加权UWxHxC获取加权特征图image.png
    Where the balance weight 1 MN is used to maintain the same proportion of loss for different numbers of instance sets, and M, N represent the number of positive and negative instances, respectively. 其中,对于不同数量的实例集,平衡权重1 mn用于保持相同的损失比例,并且M,N分别表示正实例和负实例的数量。
    Compared with the logistic loss Ll in formula (2), the Triplet loss Lt has the following advantages. 与式 (2) 中的逻辑损耗Ll相比,三重态损耗Lt具有以下优点。Firstly, Ltis the weighted average of M × N variates (the combination of M examplar-positive pairs and N examplar-negative pairs)while Llonly contains M + N varied losses(M examplar-positive pairs + N examplar-negative pairs), if M ≥ 2 and N ≥ 2, then M × N ≥ M + N, as shown in Fig. 3.1. 首先,Ltis M × N变量的加权平均值 (M例正对和N例负对的组合) 而Llonly包含M + N变化损失 (M例-正对 + N例-负对),如果M ≥ 2和N ≥ 2,则M × N ≥ M + N,如图所示。3.1. The loss function contains more varied losses means the more powerful representation, which it can capture more information between positive and negative instances.损失函数包含更多不同的损失意味着更强大的表示,它可以在正负实例之间捕获更多信息。 Secondly, the Lt is defined on the original score map based on the combination of positive and negative scores. 其次,基于正负分数的组合,在原始分数图上定义了Lt。Therefore, the same input is used to feed the network which means that no additional calculations are required during the training. 因此,相同的输入用于网络馈送,这意味着在训练期间不需要额外的计算。
    3.2 Channel attention
    Intuitively, the effects of different feature channels on different tracking objects are also different. 直观地说,不同的特征通道对不同的跟踪对象的影响也是不同的。For example, some feature channels can be very important when tracking certain objects, but are less important when tracking other objects.例如,某些特征通道在跟踪某些对象时可能非常重要,但在跟踪其他对象时则不太重要。 If the importance of the feature channel can be automatically adjusted according to the tracking object, that is, the channel weight of the model will be adaptively changed according to the tracking object. 如果特征通道的重要性可以根据跟踪对象自动调整,也就是说,模型的通道权重将根据跟踪对象自适应地改变。Therefore, a channel attention module is designed in our model. 因此,在我们的模型中设计了一个channel attention模块。The purpose of this module is to learn the weight coefficients of different feature channels according to the objective function of our model. 该模块的目的是根据我们模型的目标函数来学习不同特征通道的权重系数。
    In this paper, the feed forward neural network is adopted to implement the channel attention module. 本文采用前馈神经网络实现信道注意模块。The input to the channel attention module is the output of the embedded layer, i.e. the feature channels UWxHxC are extracted from the exemplar image (W and H are width and height of feature map, and C is the number of channels).通道注意模块的输入是嵌入层的输出,即功能频道U
    WxHxC从示例图像中提取 (W和H是特征图的宽度和高度,C是通道数)。 As shown in Fig. 3.2, the channel attention module is consist of globally pooling layer, fully connected layerfc1, fc2, and a Sigmoid layer. 如图3.2所示,信道注意模块由全局池层、完全连接的layerfc1、fc2和Sigmoid层组成。Input UWxHxC to channel attention module, through the channel attention module the feature channel weight coefficient set β1x1xC = {β1,β2,….,βC}is obtained,and then the feature channel UWxHxC weighted by coefficient set β1x1xC to obtain weighted feature channel ∼ U WxHxC . 将UWxHxC输入到通道注意模块,通过通道注意模块获得特征通道权重系数集 β1x1xC = {β1,β2,β, βC},然后用系数集 β1x1xC对特征通道UWxHxC进行加权,得到加权特征通道 ∼ U WxHxC。To be noticed, the channel attention is only involved in the exemplar branch of the Siamese network and only perform the inference of channel attention module in the first frame, which leading to the high running speed. 值得注意的是,信道注意仅涉及孪生神经网络的示例性分支,并且仅在第一帧中执行信道注意模块的推断,这导致了高运行速度。
    image.png Fig. 3.3. Sample pairs obtained by data augmentation. 通过数据增强获得的样本对。The first row in the figure is an exemplar image and the second row is a search image. The above image is from the COCO dataset. 图中的第一行是示例图像,第二行是搜索图像。上图来自COCO数据集。
    3.3. Training data augmentation 训练数据扩充
    In the object tracking tasks, the most important thing for end-to-end model training is to have high quality training data. 在对象跟踪任务中,端到端模型训练最重要的是拥有高质量的训练数据。However, the SiamFC tracker is only trained on the ILSVRC dataset, which contains only 4000 labeled video sequences. 然而,SiamFC跟踪器仅在ILSVRC数据集上训练,该数据集仅包含4000个标记的视频序列。To make the model more generalizable, more categories of sample pairs are needed.为了使模型更加可推广,需要更多类别的样本对。 In this case, we greatly expands the categories of sample pair by introducing a large-scale ImageNet detection set and COCO detection set. 在这种情况下,我们通过引入大规模ImageNet检测集和COCO检测集,极大地扩展了样本对的类别。As shown in Fig. 3.3, through data augmentation techniques(translation, resizing, grayscale, etc.), still images from the detected dataset can generate image pairs for training, thereby greatly increasing the categories of sample pairs. Specifically, the training sample pairs are generated from the same original image with the label. 如图所示。3.3,通过数据增强技术 (平移,调整大小,灰度等),来自检测到的数据集的静止图像可以生成用于训练的图像对,从而大大增加了样本对的类别。具体来说,训练样本对是从具有标签的相同原始图像生成的。为了训练我们的模型,我们从原始图像中裁剪示例图像,跟踪对象居中,示例图像的大小裁剪为127 × 127。At the same time, the original image with object has undergone an apparent translation, resizing,etc. 同时,带有对象的原始图像经历了明显的平移、调整大小等。We crop the search image after those transformations with size fixed to 255 × 255. 在那些大小固定为255 × 255的变换之后,我们裁剪搜索图像。From Fig. 3.3 we can see the tracking objects in the image pairs generated from still images(e.g. bus, plate and athlete) was never included in the video datasets, which means more categories of samples be generated. 从图3.3中,我们可以看到从静止图像生成的图像对中的跟踪对象 (例如总线、平板和运动员) 从未包含在视频数据集中,这意味着生成更多类别的样本。The diversity categories of sample pairs can promote the generalization ability of the tracker. 样本对的多样性类别可以促进跟踪器的泛化能力。
    3.4. Distractor-aware module 干扰感知模块
    After data augmentation, it is still difficult to convert the general model into a specific video domain when tracking a specific object, so how to utilizing the context information is very important. 在数据扩充之后,在跟踪特定对象时,仍然很难将一般模型转换为特定的视频域,因此如何利用上下文信息非常重要。For this consideration, a distractor-aware module is introduced. 考虑到这一点,引入了一个干扰感知模块。The core idea of the distractor-aware module: during the tracking process, many object candidates are selected from the previous frame, the Non Maximum Suppression(NMS) is used to choose the potential distractors. 干扰物感知模块的核心思想: 在跟踪过程中,从上一帧中选择了许多候选对象,使用非最大抑制 (NMS) 来选择潜在的干扰物。In tracking phase, the score map we obtained need to minus the score map produced by distractors and search image. 在跟踪阶段,我们获得的分数图需要减去由干扰物和搜索图像产生的分数图。Specifically, first in each frame detection result, use NMS to filter out the possible distractors collection D = {∀di ∈ D, f(z, di) > h∩di /= zt}.具体来说,首先在每个帧检测结果中,使用NMS过滤掉可能的干扰物集合D = {∀di ∈ D,f(z,di)> h ∩di /= zt}。Where z denote the tracking object of current frame, h denote the given threshold, di denote the possible distractors, zt is the selected tracking object in the t-th frame, and the number of elements in the set is |D| = n. 其中z表示当前帧的跟踪对象,h表示给定阈值,di表示可能的干扰,zt是第t帧中选择的跟踪对象,集合中的元素数量为 | D | = n。After selecting the set of distractors in each frame, obtain a distractor-aware objective function based on Eq. (1), 在每帧中选择一组干扰物后,基于公式 (1) 获得一个干扰物感知目标函数,and proposals setPwhich have top-k similarities with the exemplar image will be re-ranked. 并且与范例图像具有顶级相似性的提案设置将被重新排序。Denote the final selected object as q: 将最终选择的对象表示为q:
    image.png
    Where the weight factor aˆ used to control the influence of distractor learning, the weight factor ai is used to control the influence for each distractor di.其中权重因子a ˆ 用于控制错误选择学习的影响,权重因子ai用于控制每个错误选择的影响。 Intuitively, the working principle of the distractor-aware module is to be as similar as possible to the object z when selecting the candidate proposal pk in the current frame, which means the response score f(z, pk) is as high as possible, and at the same time, the difference with the distractors di is as large as possible, that is, the response score f(di,pk) is as small as possible. 直观地说,当在当前框架中选择候选方案pk时,干扰感知模块的工作原理应尽可能与对象z相似,这意味着响应分数f(z,pk) 尽可能高,同时,与干扰物di的差异尽可能大,即响应分数f(di,pk) 尽可能小。To be noticed that with the direct calculation in Eq. (6), the computational complexity and memory usage are increased by a factor of n. 需要注意的是,通过等式 (6) 中的直接计算,计算复杂性和内存使用增加了n倍。Since the cross-correlation operation in Eq. (1) is a linear operator, we use this property to speed up the calculation. 由于等式 (1) 中的互相关运算是一个线性算子,我们使用这个性质来加快计算。So the Eq. (6) is transformed to Eq. (7) to reduce the computational complexity: 因此,公式 (6) 转化为公式 (7) 以降低计算复杂性:image.png
    Through such a transformation, the computational complexity in the tracking process can be reduced, and the tracking speed can be accelerated. Where • denote the cross-correlation operation. 通过这种转换,可以降低跟踪过程中的计算复杂度,加快跟踪速度。其中 • 表示互相关操作。
    4. Experiments 实验
    4.1. Pre-processing training data 预处理训练数据
    During the model training process, the input of the model is a pair of images.(exemplar-search). 在模型训练过程中,模型的输入是一对图像。(示例搜索)。So the samples in the original dataset need to be pre-processed to generate a pair of image suitable for training.因此,需要对原始数据集中的样本进行预处理,以生成适合训练的一对图像。 The training dataset is available from both the video (ImageNet Video dataset) and the still image (ImageNet detection dataset and COCO detection dataset).训练数据集可从视频 (ImageNet视频数据集) 和静止图像 (ImageNet检测数据集和COCO检测数据集) 获得。 The dataset from the video is generated by the same method in [10], which is to extract sample pairs in two different frame of same video, and the two frames are separated by at most T frames. 视频中的数据集是通过 [10] 中的相同方法生成的,即在同一视频的两个不同帧中提取样本对,并且这两个帧最多被T帧分开。The exemplar and the search image are extracted centered on the tracking object. 提取以跟踪对象为中心的示例图像和搜索图像。For datasets from still images we use the methods described in Section 3.3 to generate a large number of sample pairs with diversity categories. 对于来自静止图像的数据集,我们使用第3.3节中描述的方法来生成大量具有多样性类别的样本对。We set the feeding ratio between video image pairs and still image pairs as 1: 2 in one batch during training. 在训练期间,我们将视频图像对和静止图像对之间的馈送比率设置为1: 2。
    The exemplar image and search image size are set to fixed values of 127 × 127 and 255 × 255, respectively. 示例性图像和搜索图像尺寸分别设置为127 × 127和255 × 255的固定值。To keep all images as squares, we add context margin on top of the original image following the Eq. (8). 为了保持所有图像为正方形,我们在原始图像的顶部添加上下文边距,如下式 (8)。Assuming that the target bounding box size is (w, h)and the context margin isp, the image scaling factor s is calculated by Eq. (8).假设目标边界框大小为 (w,h) 和上下文边距isp,图像缩放因子s由等式 (8) 计算。 For the exemplar image, A = 127 × 127, the context margin is p = (w + h)/4, and the final exemplar image area is a square area centered on the object bounding box.对于示例图像,A = 127 × 127,上下文边距为p = (w + h)/4,最后的示例图像区域是以对象边界框为中心的正方形区域。 For the same reason, the same is true for the scale of the search image, but A = 255 × 255 for the search image. 出于同样的原因,搜索图像的比例也是如此,但搜索图像的A = 255 × 255。
    image.png
    4.2. Training ** 训练
    Our model is obtained through end-to-end training on large datasets.我们的模型是通过对大型数据集的端到端培训获得的。 The model structure is shown in Fig. 1.1. 模型结构如图1.1所示。The model training process is roughly as follows: 模型训练过程大致如下:First, the pre-processed sample pairs (exemplar-search) are respectively input into two branches (ϕ(z),ϕ(x)) of the Siamese network, and the weights of the two branches of the network are shared.首先,将预处理的样本对 (示例搜索) 分别输入到孪生神经网络的两个分支 (ϕ(z),ϕ(x)) 中,并且共享网络的两个分支的权重。 After the embedded function processing in the two branches, the output feature map is obtained. 在两个分支中进行嵌入式函数处理后,得到输出特征图。Then, the feature map outputted in the exemplar image branch is subjected to channel attention weighting processing to obtain the channel weighted feature map. 然后,对示例性图像分支中输出的特征图进行信道注意加权处理,得到信道加权特征图。In addition, cross-correlation operation are performed on the feature maps in the two branches to obtain a final score map.此外,对两个分支中的特征图执行互相关操作,以获得最终的得分图。 Finally, the model is trained on the score map using the Triplet loss as the objective function until the model converges. 最后,使用三重损失作为目标函数在得分图上训练模型,直到模型收敛。The details of the training are as follows:细节的训练如下:
    (1) VGG16 as backbone network VGG16作为骨干网络
    There are three reasons for us to choose VGG16 instead of AlexNet as the backbone network for feature extraction.我们选择VGG16而不是AlexNet作为特征提取的主干网络有三个原因。 Firstly, we need to implement a CNN without using padding layers, which is a restriction of using most of the up-to-date CNNs, such as AlexNet, since padding operations are required. 首先,我们需要在不使用填充层的情况下实现CNN,这是使用大多数最新CNN (例如AlexNet) 的限制,因为需要填充操作。Secondly, VGG16 had more convolutional layers than Alexnet(only 5 convolutional layers), which means the deeper the network is, the stronger the ability of feature representation is. 其次,VGG16比Alexnet (只有5个卷积层) 具有更多的卷积层,这意味着网络越深,特征表示能力就越强。Thirdly, the VGG-like networks can be pre-trained on classification dataset and then adapted to tracking tasks [26]. 第三,VGG样网络可以在分类数据集上进行预训练,然后适应跟踪任务 [26]。The structural details of VGG16 are shown in Table 4.1. VGG16的结构细节见表4.1。
    (2) Training details 培训详情
    Stochastic gradient descent (SGD) is applied with learning rate from 10−4 to 10−7 during training. 随机梯度下降 (SGD) 在训练期间的学习率为10-4至10-7。The whole training process contains over 200 epochs. 整个培训过程包含200多个时期。The size of mini-batches is set to 8 in each iteration to estimate the gradients.每次迭代时,小批的大小设置为8,以估计梯度。 In order to deal with the problem of deep neural network learning well, a Tesla K80 GPU is used to accelerate the model training. 为了处理好深度神经网络学习问题,采用Tesla K80 GPU加速模型训练。The CPU is a Core i5–7300HQ quad-core processor. TensorFlow 1.9.0 was adopted as the deep learning framework. CPU是核心i5-7300HQ四核处理器。深度学习框架采用TensorFlow 1.9.0。

    Table 4.1 Structural details of VGG16. VGG16 have 11 convolutional layers and 3 maxpooling layers, all the convolutional layers are integrated with ReLU besides the last one for generating outputs. Vgg16的工程细节。VGG16有11个卷积层和3个最大池层,除了最后一个之外,所有的卷积层都与ReLU集成,用于生成输出。、
    image.png
    4.3. Online tracking 在线跟踪
    In the online tracking phase, the object position is given in the first frame of the video, and the search image area in the next frame is extracted centered on the object position in the first frame.在在线跟踪阶段,在视频的第一帧中给出对象位置,并且在下一帧中的搜索图像区域以第一帧中的对象位置为中心提取。 Similarly, in the video subsequence, the position of the estimated object in the previous frame is taken as the center of the search area in the current frame.类似地,在视频子序列中,将上一帧中估计对象的位置作为当前帧中搜索区域的中心。 Image pairs generated from two consecutive frames are input into the model to obtain a 17 × 17 score map, then the score map is upsampled from 17 × 17 to 257 × 257 using bicubic interpolation.将从两个连续帧生成的图像对输入模型以获得17 × 17的得分图,然后使用双三插值从17 × 17到257 × 257对得分图进行向上采样。 Due to the original score map are relatively rough, after the upsampling, the object is more accurately positioned. In order to deal with scale transformation, we searches for object over three scales (1.040−1,0, 1), the penalty rate of scale change is 0.97, and the scale is updated with a learning rate of 0.59. 由于原始得分图相对粗糙,向上采样后,对象定位更加准确。为了处理尺度变换,我们搜索对象尺度 (1.040-1,0,1),惩罚费率规模变化0.97并以0.59的学习率更新量表。
    image.png
    Fig. 4.1. visualization of class activation map 类激活图的可视化
    4.4. Visual analysis of the model 模型的可视化分析
    (1) Class activation map visualization 类激活地图可视化
    The class activation map (CAM) method refers to generating a class-activated heat map for an input image. 类激活图 (CAM) 方法是指为输入图像生成类激活热图。CAM allows us to intuitively understand which part of an image allows the network (VGG16) to make the final tracking decision.CAM使我们能够直观地了解图像的哪一部分允许网络 (VGG16) 做出最终跟踪决定。 Specifically, a CAM is a two-dimensional score grid associated with a particular tracking object, calculated for each position of any input image, which indicates how important each location is to its object tracking decision.具体地,CAM是与特定跟踪对象相关联的二维得分网格,针对任何输入图像的每个位置计算,其指示每个位置对其对象跟踪决策的重要性。 The specific implementation is the Grad-CAM method proposed by Ramprasaath R. Selvaraju et al. [27]. 具体实现是Ramprasaath R提出的Grad-CAM方法。Selvaraju等人 [27]。The results is shown in Fig. 4.1, the input image shows where the tracking object is located. After the visualization of the CAM, the class-activated heat map is obtained. 结果如图4.1所示,输入图像显示跟踪对象的位置。在凸轮可视化之后,获得了类激活热图。Finally, the heat map is superimposed on the original image to obtain a superimposed image, which can have more intuitive feeling for the object tracking decision.最后,将热图叠加在原始图像上以获得叠加图像,这对于目标跟踪决策具有更直观的感觉。 Observing the heat map, we found that the heatmap response is very high at the upper left corner, indicating that this position has a great influence on the object tracking decision. 观察热图,我们发现热图响应在左上角非常高,表明该位置对目标跟踪决策有很大影响。Conversely, the area around the location has little or no effect on the object tracking decision. 相反,位置周围的区域对对象跟踪决策几乎没有影响。Then observe the superimposed image we can verify that the position of the highest response is the location of the tracking object. 然后观察叠加图像,我们可以验证最高响应的位置是跟踪对象的位置。
    (1) Visualization of score map 分数图的可视化
    If the heat map of class activation is to let us know which part of the image plays a crucial role in tracking decision. 如果类别激活的热图是让我们知道图像的哪一部分在跟踪决策中起着至关重要的作用。Then the score map of the final output of the model determines the positioning of the final object.当模型的最终输出的得分图决定了最终对象的定位时。 Since the size of the original score map is 17 × 17, which is relatively rough, it is necessary to upsample the score map from 17 × 17 to 257 × 257, and after the upsampling, the object can be more accurately located.由于原始得分图的大小为17 × 17,相对粗糙,因此需要对得分图从17 × 17到257 × 257进行上采样,可以更准确地定位对象。 we compared the score map of the SiamFC tracker with the ours on the same frame, as shown in Fig. 4.2.如图4.2所示,我们在同一帧上比较了SiamFC跟踪器和我们的得分图。 From the comparison of the score map, it can be seen that when there is intra-class interference around the tracking object, our tracker is still maintains a high level of response to tracking object. 从得分图的比较可以看出,当跟踪对象周围存在类内干扰时,我们的跟踪器仍然保持对跟踪对象的高水平响应。For example, there are many distractors around the object in basketball, but our model can suppress these distractors well, and SiamFC still has a high response to surrounding distractors. 或者例如,在篮球中,物体周围有许多干扰物,但是我们的模型可以很好地抑制这些干扰物,并且SiamFC仍然对周围的干扰物有很高的响应。
    image.png
    Fig. 4.2. Visualization of the score map. score map的可视化。 The first row is the video frame where the object is tracked, the second row is the output score map of the SiamFC model, and the third row is the output score map of our model 第一行是跟踪对象的视频帧,第二行是SiamFC模型的输出得分图,第三行是我们模型的输出得分图

    Table 4.2 Evaluation on VOT2016 benchmark VOT2016基准评估
    image.png
    4.5. VOT2016 benchmark VOT2016基准
    Visual Object Tracking (VOT) is a test platform for single object tracking tasks.视觉对象跟踪 (VOT) 是针对单个对象跟踪任务的测试平台。 From 2013 to now, it has become one of the three major platforms in the single object tracking field, the other two are OTB, ALOV. 从2013年到现在,它已经成为单一对象跟踪领域的三大平台之一,另外两个是OTB、ALOV。There are many versions of the VOT benchmark, for example, VOT2015 [28], VOT2016 [29] and VOT2017 [30].VOT基准有许多版本,例如VOT2015 [28] 、VOT2016 [29] 和VOT2017 [30]。 VOT2015 and VOT2016 contains the same sequence, but the ground truth labels in VOT2016 are more accurate than those in VOT2015.VOT2015和VOT2016包含相同的序列,但VOT2016中的地面真相标签比VOT2015中的更准确。 In VOT2017, only ten sequences from VOT2016 are replaced by new sequences. 在VOT2017中,VOT2016中只有10个序列被新序列替换。Here, we use the VOT2016 benchmark to evaluate our model. 在这里,我们使用VOT2016基准来评估我们的模型。The tracking results of the model on VOT2016 is shown in Fig. 4.3.模型在VOT2016上的跟踪结果如图4.3所示。 In the first and second row of the Fig. 4.3, the tracking object undergone a fast motion and intra-class interference, but our model tracks the object accurately without lose the target.在图4.3的第一行和第二行中,跟踪对象经历了快速运动和类内干扰,但是我们的模型精确地跟踪对象而不会丢失目标。 In the third and fifth row of the Fig. 4.3, the tracking object undergone a out-of-view and light change challenge, but our model can still tracks the targets accurately and robustly. 在图4.3的第三和第五行中,跟踪对象经历了视外和光变化的挑战,但是我们的模型仍然可以准确而稳健地跟踪目标。
    The main evaluation indicators for the VOT benchmark are expected average overlap rate (EAO), robustness (R) and accuracy (A). VOT基准的主要评价指标为预期平均重叠率 (EAO) 、稳健性 (R) 和准确性 (A)。A good tracker has a high accuracy and EAO scores, but the robustness scores is very low. 一个好的跟踪器具有很高的准确性和EAO分数,但是稳健性分数非常低。The model proposed in this paper is compared with the SiamFC tracker on VOT2016, and the tracking speed is compared too. 本文提出的模型与VOT2016上的SiamFC跟踪器进行了比较,并对跟踪速度进行了比较。The specific evaluation results are shown in Table 4.2. 具体评估结果见表4.2。It illustrates that our tracker is better than the SiamFC tracker in accuracy, robustness and expected average overlap rate. 它说明了我们的跟踪器在准确性、鲁棒性和预期平均重叠率方面优于SiamFC跟踪器。The accuracy and EAO scores increased by 0.011 and 0.087, respectively. 准确度和EAO评分分别提高了0.011和0.087。The improvement on robustness section with 15% higher performance compared to SiamFC. 与SiamFC相比,鲁棒性部分的改进性能提高了15%。Although the model is not as good at tracking speed as SiamFC tracking.尽管该模型的跟踪速度不如SiamFC跟踪。 However, real-time tracking speed (>25FPS) has been achieved. 然而,实时跟踪速度 (>25FPS) 已经实现。The main reason for the decline of FPS can be attributed to the impact of GPU performance and the adoption of more complex network(VGG16) in our model. FPS下降的主要原因可以归因于GPU性能的影响以及我们模型中采用了更复杂的网络 (VGG16)。
    image.png
    Fig. 4.3. Tracking results on VOT2016 benchmark. The video categories from the first row to the fifth row is bolt1, basketball, butterfly, ball1, singer1, respectively.
    VOT2016基准的跟踪结果。从第一行到第五行的视频类别分别是bolt1、篮球、蝴蝶、ball1、singer1。
    5. Conclusion
    In this paper, we proposed a real-time object tracker based on improved Fully-convolutional Siamese network. During offline training, a Triplet loss and attention mechanism are introduced in our model, and adopted a data augmentation strategy to enriches the categories of sample pairs, which can significantly boost the generalization power of our model. During inference, the distractor-aware module is adopted in the tracking phase to effectively suppress the intra-class interference objects. In the experiment, the score map of the model is visualized, which shows that our model has a significa suppression effect on the distractors in the background and has a high response to the foreground object. Then, the model is tested on the VOT2016 benchmark,which the results show that our proposed model has better accuracy, robustness and EAO compared to SiamFC. Although the model is not as good at tracking speed as SiamFC tracking. However, real-time tracking speed has been achieved.
    Declaration of competing interest 竞争利益声明
    none
    Supplementary materials 补充材料
    Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.compeleceng. 2020.106755 与本文相关的补充材料可以在在线版本中找到,网址为doi:10.1016/j.com peleceng。2020.106755