The website uses cookies. By using this site, you agree to our use of cookies as described in the Privacy Policy.
I Agree

【SOT】Siamese RPN++ 论文和代码解析

东南大学 交通运输工程硕士在读

1.前言

有关 Siamese FC的论文和代码解析Siamese RPN的论文和代码解析已经发布于深度学习【目标追踪】专栏。如有需要,请自行查看。

上一节,我们在进行Siamese RPN网络解析时,留下一个疑问,即

为什么2018年提出的Siamese RPN网络还在用老式的AlexNet作为Siamese Network的特征提取网络呢?

终于,我们在 Siamese RPN++论文中找到了答案。论文中提到:

We observe that all these trackers have built their network upon architecture similar to AlexNet and tried several times to train a Siamese tracker with more sophisticated architecture like ResNet yet with no performance gain. Inspired by this observation, we perform an analysis of existing Siamese trackers and find the core reason comes from the destroy of the stric translation invariance.

意思就是说研究者们也做过实验,发现用ResNet换上AlexNet后发现效果并没有提升,发生这种情况的原因很可能与严格的平移不变性被破坏有关。

接着作者提到:

Since the target may appear at any position in the search region, the learned feature representation for the target template should stay spatial invariant, and we further theoretically find that, among modern deep architectures, only the zero-padding variant of AlexNet satisfies this spatial invariance restriction.

作者发现被追踪的目标(target)可能会出现在search image上所有可能出现的地方,那这就需要我们的template图像中的目标(target)需要存在空间不变性,作者发现只有满足padding=0的网络(比如AlexNet)才能满足这个情况。所以那些现代化网络,比如残差网络ResNet或者轻量级的MobileNet就被限制了。

本文最重要的,并不是提出了SiameseRPN++网络的,而是突破了 Siamese 网络在特征提取网络选择上的限制。

那么作者是怎么解决该限制的呢?

we introduce a simple yet effective sampling strategy to break the spatial invariance restriction of the Siamese tracker

是的,作者通过了一个简单有效的采样策略打破这个限制的,这会在后面进行详细说明。

同时,解除了网络padding限制后,作者结合ResNet也提出了一个网络,叫做 Siamese RPN++。论文中是这样说的:

Benefiting from the ResNet architecture, we propose a layer-wise feature aggravation structure for the cross-correlation operation, which helps the tracker to predict the similarity map from features learned at multiple levels.

很明显,作者使用了多层特征图融合来进行操作,获得精度更高的预测效果,这在YOLO 等目标检测模型中层出不穷。

不光是特征融合做了改进,作者还提到:

By analyzing the Siamese network structure for cross-correlations, we find that its two network branches are highly imbalanced in terms of parameter number; therefore we further propose a depth-wise separable correlation structure which not only greatly reduces the parameter number in the target template branch, but also stabilizes the training procedure of the whole model.

作者对correlation structure也做了改进,降低了运算量,提高了训练的稳定性,具体的将在结构解析进行说明。

本文将结合下面代码,进行深度解析和讲解

https://gitee.com/pako_lazysloth/SiameseX.PyTorch?_from=gitee_search​gitee.com

2. 空间感知采样策略(spatial aware sampling strategy)

不知道各位在阅读本文之前,有没有了解过Siamese FC的训练方式,我们在Siamese FC的论文和代码解析中分析过。Siamese FC在训练过程中,以Search image中的目标所在位置为中心进行一些裁剪填充,作为训练过程中的 Search 部分。那么在训练过程中,这个目标一直都在Search image的中心。

(1)shift=0

这会导致一个什么样的问题呢?作者做了个实验,得出了下面的结论和图

In the first simulation with zero shift, the probabilities on the border area are degraded to zero. It shows that a strong center bias is learned despite of the appearances of test targets.

shift=0的意思就是target被放在正中心,没有任何偏移,模拟的就是Siamese FC中的训练情况。那么产生的结果就是不论在测试的时候,target在图片的哪个位置,甚至外观特征有多明显,都会在图片的中心产生一个这样大的偏移,这和在训练过程中采用的方式有关。

(2)shift=16以及shift=32

作者开始尝试把target均匀分布某个范围内,而不是一直在中心时(范围是离中心点一定距离,该距离为shift,target在这个范围内均匀分布

做了两个实验,分别为shift=16和shift=32,得出以下结果图和结论:

The other two simulations show that increasing shift ranges will gradually prevent model collapse into this trivial solution. The quantitative results illustrate that the aggregated heatmap of 32-shift is closer to the location distribution of test objects. It proves that the spatial aware sampling strategy effectively alleviate the break of strict translation invariance property caused by the networks with padding

可以看出来,当存在shift且shift愈来愈大的时候,热力区域与测试物体所在分布有关,也越来越大。并不像shift=0的时候在中心出现那么大的一块偏差区域。作者在两个数据集上进行了测试,发现的确存在shift的时候,会提升模型的表现,如下图所示,shift=64的时候最佳。

到这里,我相信大家对这个空间感知采样策略(spatial aware sampling strategy)有个大致的概念了。

本来想着带大家探索下代码中是如何实现空间感知采样策略的,但是发现代码中并没有实现该策略(或者就是我眼拙,没有看到,哪位小伙伴看到了可以评论区告诉我,十分感谢)

那么我们就直接跳过对该策略的代码实现与解析。接下来细聊一下有关Siamese RPN++的网络结构。

(3)网络结构解析

文中提到的有关Siamese RPN++网络的结构,主要分为以下三部分:

  • 特征提取网络: 改进的残差网络ResNet-50
  • 逐层特征融合(Layerwise Aggregation)
  • Depthwise Cross Correlation

接下来按照分成的三部分进行详细说明。

(1)特征提取网络

论文中提到

The original ResNet has a large stride of 32 pixels, which is not suitable for dense Siamese network prediction. As shown in Fig.3, we reduce the effective strides at the last two block from 16 pixels and 32 pixels to 8 pixels by modifying the conv4 and conv5 block to have unit spatial stride, and also increase its receptive field by dilated convolutions. An extra 1 x 1 convolution layer is appended to each of block outputs to reduce the channel to 256.

作者的意思就是嫌弃原来ResNet的stride过大,从而在conv4和conv5中将stride=2改动为stride=1。但是同时为了保持之前的感受野,采用了空洞卷积。其代码定义如下:

class ResNetPP(nn.Module):
    def __init__(self, block, layers, used_layers):
        self.inplanes = 64
        super(ResNetPP, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=0,  # 3
                               bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)

        self.feature_size = 128 * block.expansion
        self.used_layers = used_layers
        layer3 = True if 3 in used_layers else False
        layer4 = True if 4 in used_layers else False

        if layer3:
            self.layer3 = self._make_layer(block, 256, layers[2],
                                           stride=1, dilation=2)  # 15x15, 7x7
            self.feature_size = (256 + 128) * block.expansion
        else:
            self.layer3 = lambda x: x  # identity

        if layer4:
            self.layer4 = self._make_layer(block, 512, layers[3],
                                           stride=1, dilation=4)  # 7x7, 3x3
            self.feature_size = 512 * block.expansion
        else:
            self.layer4 = lambda x: x  # identity

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def _make_layer(self, block, planes, blocks, stride=1, dilation=1):
        downsample = None
        dd = dilation
        if stride != 1 or self.inplanes != planes * block.expansion:
            if stride == 1 and dilation == 1:
                downsample = nn.Sequential(
                    nn.Conv2d(self.inplanes, planes * block.expansion,
                              kernel_size=1, stride=stride, bias=False),
                    nn.BatchNorm2d(planes * block.expansion),
                )
            else:
                if dilation > 1:
                    dd = dilation // 2
                    padding = dd
                else:
                    dd = 1
                    padding = 0
                downsample = nn.Sequential(
                    nn.Conv2d(self.inplanes, planes * block.expansion,
                              kernel_size=3, stride=stride, bias=False,
                              padding=padding, dilation=dd),
                    nn.BatchNorm2d(planes * block.expansion),
                )

        layers = []
        layers.append(block(self.inplanes, planes, stride,
                            downsample, dilation=dilation))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes, dilation=dilation))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x_ = self.relu(x)
        x = self.maxpool(x_)

        p1 = self.layer1(x)
        p2 = self.layer2(p1)
        p3 = self.layer3(p2)
        p4 = self.layer4(p3)

        out = [x_, p1, p2, p3, p4]
        out = [out[i] for i in self.used_layers]
        if len(out) == 1:
            return out[0]
        else:
            return out

从上述代码中的这段代码可以看出

        if layer3:
            self.layer3 = self._make_layer(block, 256, layers[2],
                                           stride=1, dilation=2)  # 15x15, 7x7
            self.feature_size = (256 + 128) * block.expansion
        else:
            self.layer3 = lambda x: x  # identity

        if layer4:
            self.layer4 = self._make_layer(block, 512, layers[3],
                                           stride=1, dilation=4)  # 7x7, 3x3
            self.feature_size = 512 * block.expansion

作者在原来的ResNet-50的基础上进行了一些修改,包括stride=1与空洞卷积的使用。

(2)逐层的特征融合(Layerwise Aggregation)

熟悉目标检测的FPN网络的小伙伴们一定不会对Layerwise Aggregation陌生。一般而言,浅层的网络包含的信息更多有关于物体的颜色、条纹等,深层的网络包含的信息更多包含物体的语义特征。正如作者文中提到的那样:

Features from earlier layers will mainly focus on low level information such as color, shape, are essential for localization, while lacking of semantic information

使用特征融合可以弥补浅层信息和深层信息的不足,更有助于单目标追踪。

本文中的Siamese RPN++利用了ResNet-50在conv3、conv4、conv5的输出作为输入。如下图所示。

也就是这三个输出分别都有各自的RPN网络,并不是通过堆叠或者相加进行融合。这在代码中是这样体现的

self.features = resnet50(**{'used_layers': [2, 3, 4]})

先使用改进后的resnet50作为特征提取网络,返回输出的层 id 为[2,3,4],其实就是conv3、conv4、conv5的输出。

然后分别对template图像和需要detection的图像分别进行特征提取。

        zf = self.features(template)
        xf = self.features(detection)

事实上,zf 和 xf 并不是单一的特征图,而是一个列表,每个列表中包含了三个特征图。

接着,在特征提取结束后,需要对提取的特征进行调整,代码中是这样实现的:

        zf = self.neck(zf)
        xf = self.neck(xf)

那么这个neck的定义如下:

        self.neck = AdjustAllLayer(**{'in_channels': [512, 1024, 2048], 'out_channels': [256, 256, 256]})

AdjustAllLayer定义如下:

class AdjustLayer(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(AdjustLayer, self).__init__()
        self.downsample = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            )

    def forward(self, x):
        x = self.downsample(x)
        if x.size(3) < 20:
            l = 4
            r = l + 7
            x = x[:, :, l:r, l:r]
        return x

class AdjustAllLayer(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(AdjustAllLayer, self).__init__()
        self.num = len(out_channels)
        if self.num == 1:
            self.downsample = AdjustLayer(in_channels[0], out_channels[0])
        else:
            for i in range(self.num):
                self.add_module('downsample'+str(i+2),
                                AdjustLayer(in_channels[i], out_channels[i]))

    def forward(self, features):
        if self.num == 1:
            return self.downsample(features)
        else:
            out = []
            for i in range(self.num):
                adj_layer = getattr(self, 'downsample'+str(i+2))
                out.append(adj_layer(features[i]))
            return out

可见上面的AdjustAllLayer的作用就是对特征提取网络输出的三个特征图分别做1x1卷积,调整所有特征图的通道数为256。论文中也有提及:

An extra 1 x 1 convolution layer is appended to each of block outputs to reduce the channel to 256

上述代码中有个很有趣的forward,即

    def forward(self, x):
        x = self.downsample(x)
        if x.size(3) < 20:
            l = 4
            r = l + 7
            x = x[:, :, l:r, l:r]
        return x

这么做是为了什么呢?我们发现原文中提到:

Since the paddings of all layers are kept, the spatial size of the template feature increases to 15, which imposes a heavy computational burden on the correlation module. Thus we crop the center 7 x 7 regions as the template feature where each feature cell can still capture the entire target region.

原来是因为降低运算量所以对template的特征图进行了裁剪,恍然大悟呀!原来原文中的细节都在代码中有所描述,所以我十分推荐大家结合代码看论文,这是非常容易进行理解的。

特征提取以及所有通道都压缩至256后,紧接着代码实现如下:

        cls, loc = self.head(zf, xf)

其中head定义如下:

        self.head = MultiRPN(**{'anchor_num': 5, 'in_channels': [256, 256, 256], 'weighted': True})

其中MultiRPN定义如下:

class MultiRPN(RPN):
    def __init__(self, anchor_num, in_channels, weighted=False):
        super(MultiRPN, self).__init__()
        self.weighted = weighted
        for i in range(len(in_channels)):
            self.add_module('rpn'+str(i+2),
                    DepthwiseRPN(anchor_num, in_channels[i], in_channels[i]))
        if self.weighted:
            self.cls_weight = nn.Parameter(torch.ones(len(in_channels)))
            self.loc_weight = nn.Parameter(torch.ones(len(in_channels)))

    def forward(self, z_fs, x_fs):
        cls = []
        loc = []
        for idx, (z_f, x_f) in enumerate(zip(z_fs, x_fs), start=2):
            rpn = getattr(self, 'rpn'+str(idx))
            c, l = rpn(z_f, x_f)
            cls.append(c)
            loc.append(l)

        if self.weighted:
            cls_weight = F.softmax(self.cls_weight, 0)
            loc_weight = F.softmax(self.loc_weight, 0)

        def avg(lst):
            return sum(lst) / len(lst)

        def weighted_avg(lst, weight):
            s = 0
            for i in range(len(weight)):
                s += lst[i] * weight[i]
            return s

        if self.weighted:
            return weighted_avg(cls, cls_weight), weighted_avg(loc, loc_weight)
        else:
            return avg(cls), avg(loc)

(3)Depthwise Cross Correlation

相比于

(1)Siamese FC中直接对特征化后的Template和Search Region进行卷积运算,上图(a)所示

(2)Siamese RPN中对Template升高维度后在与Search Region进行卷积运算,上图(b)所示。

Siamese RPN++中选用了Depth-wise Cross Correlation Layer对Template和Search Region进行卷积运算。这么做是为了降低计算量。具体实现过程如下

class DepthwiseXCorr(nn.Module):
    def __init__(self, in_channels, hidden, out_channels, kernel_size=3, hidden_kernel_size=5):
        super(DepthwiseXCorr, self).__init__()
        self.conv_kernel = nn.Sequential(
            nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(inplace=True),
        )
        self.conv_search = nn.Sequential(
            nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(inplace=True),
        )
        self.head = nn.Sequential(
            nn.Conv2d(hidden, hidden, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden, out_channels, kernel_size=1)
        )

    def forward(self, kernel, search):
        kernel = self.conv_kernel(kernel)
        search = self.conv_search(search)
        feature = xcorr_depthwise(search, kernel)
        out = self.head(feature) #维度提升
        return out

class DepthwiseRPN(RPN):
    def __init__(self, anchor_num=5, in_channels=256, out_channels=256):
        super(DepthwiseRPN, self).__init__()
        self.cls = DepthwiseXCorr(in_channels, out_channels, 2 * anchor_num)
        self.loc = DepthwiseXCorr(in_channels, out_channels, 4 * anchor_num)

    def forward(self, z_f, x_f):
        cls = self.cls(z_f, x_f)
        loc = self.loc(z_f, x_f)
        return cls, loc

其中xcorr_depthwise定义如下

def xcorr_depthwise(x, kernel):
    """depthwise cross correlation
    """
    batch = kernel.size(0)
    channel = kernel.size(1)
    x = x.view(1, batch*channel, x.size(2), x.size(3))
    kernel = kernel.view(batch*channel, 1, kernel.size(2), kernel.size(3))
    out = F.conv2d(x, kernel, groups=batch*channel)
    out = out.view(batch, channel, out.size(2), out.size(3))
    return out

实质上Depthwise Cross Correlation采用的就是分组卷积的思想,分组卷积可以带来运算量的大幅度降低,被广泛用于MobileNet系列网络中。

通过上述代码可以看出,与Siamese RPN网络不同,Siamese RPN++提升网络通道数为2k或者4k的操作是在卷积操作( Cross Correlation)之后,而Siamese RPN网络是在卷积操作之前,这样就减少了大量的计算量了。这在DepthwiseXCorr类中的forward中定义出来了,如下:

    def forward(self, kernel, search):
        kernel = self.conv_kernel(kernel)
        search = self.conv_search(search)
        feature = xcorr_depthwise(search, kernel)
        out = self.head(feature) #维度提升
        return out

上面的self.head运算就是升维运算(到2k或者4k),可以看出,其发生在xcorr_depthwise之后。

最后的最后,网络最后输出的3个cls和loc分支进行了按权重融合。这在MultiRPN的forward定义了,如下:

    def forward(self, z_fs, x_fs):
        cls = []
        loc = []
        for idx, (z_f, x_f) in enumerate(zip(z_fs, x_fs), start=2):
            rpn = getattr(self, 'rpn'+str(idx))
            c, l = rpn(z_f, x_f)
            cls.append(c)
            loc.append(l)

        if self.weighted:
            cls_weight = F.softmax(self.cls_weight, 0)
            loc_weight = F.softmax(self.loc_weight, 0)

        def avg(lst):
            return sum(lst) / len(lst)

        def weighted_avg(lst, weight):
            s = 0
            for i in range(len(weight)):
                s += lst[i] * weight[i]
            return s

        if self.weighted:
            return weighted_avg(cls, cls_weight), weighted_avg(loc, loc_weight)
        else:
            return avg(cls), avg(loc)

至此,Siamese RPN++的网络结构就讲解结束了,代码总结如下:

    def forward(self, template, detection):
        zf = self.features(template) #ResNet-50特征提取
        xf = self.features(detection)

        zf = self.neck(zf) #降低维度为256
        xf = self.neck(xf)

        cls, loc = self.head(zf, xf) #RPN

        return cls, loc

总体的结构图如下:

图中右侧的支路中

  • adj系列就是降维为256的1x1卷积
  • DW_Conv就是Depthwise Cross Correlation操作
  • Box_Head就是提升维度为4k的1x1卷积
  • Cls_Head就是提升维度为2k的1x1卷积

至此,Siamese RPN++的结构解析也结束了。

4.总结

有关Siamese RPN++的解析到此结束,后面我将加快自己的论文阅读和代码解析,争取很快能够在单目标追踪领域(SOT)和多目标追踪领域(MOT)梳理出一条适合大家学习的学习路线,加深对该领域的理解。谢谢大家支持!

Measure
Measure
Summary | 7 Annotations
不论在测试的时候,target在图片的哪个位置,甚至外观特征有多明显,都会在图片的中心产生一个这样大的偏移,这和在训练过程中采用的方式有关
2021/01/17 03:43
Layerwise Aggregation陌生。一般而言,浅层的网络包含的信息更多有关于物体的颜色、条纹等,深层的网络包含的信息更多包含物体的语义特征。
2021/01/17 03:44
是这三个输出分别都有各自的RPN网络
2021/01/17 03:45
特征提取网络输出的三个特征图分别做1x1卷积,调整所有特征图的通道数为256
2021/01/17 03:45
对Template升高维度后在与Search Region进行卷积运算,
2021/01/17 03:46
Depth-wise Cross Correlation Layer对Template和Search Region进行卷积运算。这么做是为了降低计算量
2021/01/17 03:47
Depthwise Cross Correlation采用的就是分组卷积的思想,分组卷积可以带来运算量的大幅度降低,被广泛用于MobileNet系列网络中
2021/01/17 03:47