The website uses cookies. By using this site, you agree to our use of cookies as described in the Privacy Policy.

# 如何理解SiamRPN++?

### 目标跟踪：

• 使用视频序列第一帧的图像(包括bounding box的位置)，来找出目标出现在后序帧位置的一种方法。

### 孪生网络结构：

• 孪生网络结构被较早地利用在人脸识别的领域(《Learning a Similarity Metric Discriminatively, with Application to Face Verification》)。其思想是将一个训练样本(已知类别)和一个测试样本(未知类别)输入到两个CNN(这两个CNN往往是权值共享的)中，从而获得两个特征向量，然后通过计算这两个特征向量的的相似度，相似度越高表明其越可能是同一个类别。在上面这篇论文中，衡量这种相似度的方法是L1距离。

• 在目标领域中，最早利用这种思想的是SiamFC，其网络结构如上图。输入的 z 是第一帧ROI(也就是在手动在第一帧选择的bounding box)， x则是后一帧的图片。两者分别通过两个CNN，得到两张特征图。再通过一次卷积操作( L=(22-6)/1+1=17)，获得最后形状为 17*17*1的特征图。在特征图上响应值越高的位置，代表其越可能有目标存在。其想法就类似于对上面的人脸识别孪生网络添加了一次卷积操作。获得特征图后，可以获得相应的损失函数。 y={-1,+1}为真实标签， v是特征图相应位置的值。下列第一个式子是特征图上每一个点的loss，第二个式子是对第一个式子作一个平均，第三个式子就是优化目标。

• However, both Siamese-FC and CFNet are lack of boundingbox regression and need to do multi-scale test which makesit less elegant. The main drawback of these real-time track-ers is their unsatisfying accuracy and robustness comparedto state-of-the-art correlation filter approaches.但是，Siamese-FC和CFNet都没有边界框回归，因此需要进行多尺度测试，这使得它不太美观。这些实时跟踪器的主要缺点是，与最新的相关滤波器方法相比，它们的精度和鲁棒性不令人满意。

• 为了解决SiamFC的这两个问题，就有了SiamRPN。可以看到前半部分Siamese Network部分和SiamFC一模一样(通道数有变化)。区别就在于后面的RPN网络部分。特征图会通过一个卷积层进入到两个分支中，在Classification Branch中的 4*4*(2k*256)特征图等价于有 2k 4*4*256形状的卷积核，对 20*20*256作卷积操作，于是可以获得 17*17*2k的特征图，Regression Branch操作同理。而RPN网络进行的是多任务的学习。 17*17*2k做的是区分目标和背景，其会被分为k个groups，每个group会有一层正例，一层负例。最后会用 softmax + cross-entropy loss进行损失计算。 17*17*4k同样会被分为k groups，每个group有四层，分别预测dx,dy,dw,dh。而k的值即为生成的anchors的数目。而关于anchor box的运行机制，可以看下方第二张图(来自SSD)。

## SiamRPN++

### Abstract & Introduction

However, Siamese track-ers still have an accuracy gap compared with state-of-the-art algorithms and they cannot take advantage of features from deep networks, such as ResNet-50 or deeper.(以往的siamese网络无法处理较深的网络)

We observe that all these trackers have built their network upon architecture similar to AlexNet [23] and tried several times to train a Siamese tracker with more sophisticated architecture like ResNet[14] yet with no performance gain. (用的都是AlexNet，而在ResNet上效果不佳)

Since the target may appear at anyposition in the search region, the learned feature representation for the target template should stay spatial invariant,and we further theoretically find that, among modern deep architectures, only the zero-padding variant of AlexNet satisfies this spatial invariance restriction.（只有AlexNet没有padding层才满足spatial invariance平移不变性）

By analyzing the Siamese network structure for cross-correlations, we find that its two network branches are highly imbalanced in terms of parameter number; thereforewe further propose a depth-wise separable correlation struc-ture which not only greatly reduces the parameter numberin the target template branch, but also stabilizes the trainingprocedure of the whole model. (分类分支和回归分支参数量严重不平衡)

• 这一点从SiamRPN的结构图中即可看出。

### Siamese Tracking with Very Deep Networks

• 对于原来siamese网络的分析：原先的孪生跟踪网络可以视为以下的式子:
$f(z,x)=\sigma(z)*\sigma(x)+b$

​ 而这种设计，会导致两个限制:

The contracting part and the feature extractor used inSiamese trackers have an intrinsic restriction forstricttranslation invariance,f(z,x[4τj]) =f(z,x)[4τj],where[4τj]is the translation shift sub window opera-tor, which ensures the efficient training and inference.（只用某个ROI的区域作为x进行运算应该和总体进行运算后取该ROI的结果一致）

The contracting part has an intrinsic restriction forstructure symmetry,i.e.f(z,x′) =f(x′,z), which isappropriate for the similarity learning.(即根据相似性度量的思想，用z作为卷积核和用x的目标位置作为卷积核效果一致)

​ 前者也就是严格平移不变性限制，后者就是目标相似限制。

#### Spatial Aware Sampling Strategy

• 而这一设置均匀分布不同偏移量的策略被称为 spatial aware sampling strategy。作者为了确定这一策略的有效性，又在VOT2016和VOT2018上做了测试，发现随着shift范围的变大，EAO指标(简单来说就是视频每一帧的跟踪精度 a的均值，而 a使用的是lOU)总体呈现上升趋势。在这些测试中， 64 pixels是一个良好的偏移量。

• 而一旦消除了深层网络对中心位置的学习偏见，就可以根据需要利用任何结构的网络以进行视觉跟踪。

#### SiamRPN对ResNet的transfer

 ID BLOCK(输入输出尺寸一致，便于加深网络):

The original ResNet has a large stride of 32 pixels,which is not suitable for dense Siamese network prediction.As shown in Fig.3, we reduce the effective strides at the last two block from 16 pixels and 32 pixels to 8 pixels by modifying the conv4 and conv5 block to have unit spatial stride, and also increase its receptive field by dilated convo-lutions [27]. An extra 1×1 convolution layer is appended to each of block outputs to reduce the channel to 256.

Reset常见的 stride值为32，但对于跟踪任务，前后帧物体差距可能很小，因此作者将最后两个block的 stride改成了 8 pixels，而且会加入一个额外的 1*1 convolution layer将通道数变为256。网络会将 stage 3/4/5的结果放入到SiamRPN网络中进行Layer-wise Aggregation。

#### Layer-wise Aggregation多层特征融合

In the previous works which only use shallow networkslike AlexNet, multi-level features cannot provide very dif-ferent representations. However, different layers in ResNetare much more meaningful considering that the receptivefield varies a lot. Features from earlier layers will mainlyfocus on low level information such as color, shape, are es-sential for localization, while lacking of semantic informa-tion; Features from latter layers have rich semantic informa-tion that can be beneficial during some challenge scenarioslike motion blur, huge deformation.The use of this rich hierarchical information is hypothesized to help tracking.

Since the output sizes of the three RPN modules have thesame spatial resolution, weighted sum is adopted directly onthe RPN output. A weighted-fusion layer combines all theoutputs.

#### 实验结果

EAO值位于第一，远超第二名。

Measure
Measure
Summary | 2 Annotations

2021/01/17 03:34

2021/01/17 03:35