RefineDet 논문 리뷰

김상현·2021년 12월 20일

논문 제목: Single-Shot Refinement Neural Network for Object Detection

Introduction

Object Detection은 1-stage detector와 2-stage detector로 나뉜다. 2-stage detector는 느린 속도를 갖지만 두 단계를 통해 더 정확한 분류와 위치 예측을 수행한다. 반면 1-stage detector의 경우 빠른 속도를 갖지만 2-stage detector보다 덜 정확한 결과를 예측한다. 특히, 1-stage detector의 낮은 성능의 주요 원인은 class imbalance 이다.

논문의 저자들은 1-stage 보다 2-stage 방법들이 다음의 3가지 이점을 갖는다고 서술했다.

2-stage 구조를 sampling heuristics와 함께 사용하여 class imbalance를 해결한다.
2-step cascade로 object box parameters를 예측(regress)한다.
2-stage feature를 사용해 object를 설명(describe)한다.

RefineDet은 1-stage와 2-stage의 이점들을 갖으면서 단점들을 극복한 새로운 detection framework이다. 이는 ARM(anchor refinement module)과 ODM(object detection module)의 inter connect를 통해 1-stage 방법을 개선했다. 특히 ARM은 (1)classifier의 search space를 줄이기 위해 negative anchors를 식별하고 제거하고, (2) 후속 regressor의 좋은 초기치를 제공하기 위해 anchors의 위치와 크기를 대략적으로 조절한다. 이렇게 refined된 anchor를 받아 ODM에서 multi-class 분류와 box regression을 수행한다. 추가로, ARM의 features를 ODM으로 전달하는 TCB(transfer connection block)를 고안했다.

본 논문의 기여는 다음과 같다.

Object detection을 위한 두 inter-connected module(ARM, ODM)으로 구성된 새로운 1-stage framework를 소개한다.
유효성을 위해 ARM의 features를 ODM으로 전달하는 TCB를 고안했다.
RefineDet은 최신의 SOTA 결과를 달성했다.

Network Architecture

사진 1. Architecture

RefineDet은 SSD와 유사하게 feed-forward convolutional network로 고정된 개수의 bounding box들과 box의 객체들의 다른 class들의 존재 점수(score)를 출력한다. 전체 구조는 사진 1과 같다. RefineDet은 두 inter-connected module, ARM과 ODM,으로 구성된다. ARM의 목표는 classifier의 search space를 줄이기 위해 negative anchors를 제거하는 것과 후속 regressor의 좋은 초기치를 제공하기 위해 anchors의 위치와 크기를 대략적으로 조절하는 것이다. ODM의 목표는 refined anchors에 기초하여 정확한 위치를 regress하는 것과 multi-class labels를 예측하는 것이다. ARM은 ImageNet dataset으로 사전 학습된 base network(VGG-16, ResNet-101)에서 classification layers를 제거하고, 보조 구조(auxiliary structures)를 추가한 구조이다. ODM은 TCB의 출력에 prediction layers(conv layers)을 이어붙인 구조이다.

Transfer Connection Block

사진 2. TCB

ARM과 ODM 사이의 연결을 위해서 TCB(transfer connection block)을 고안했다. TCB의 구조는 사진 2와 같다. TCB는 ODM에서 필요한 형태로 ARM의 다른 layers의 features를 변환하는 역할을 한다. ARM에서 anchors와 관련된 feature maps에 TCB들을 사용한다. Detection 성능을 향상시키기 위해 deconvolution operation을 사용해 high-level features를 transferred features에 추가하므로써 large-scale context를 통합한다. 이후 convolution layer를 추가해 detection을 위한 discriminability of features를 확보한다.

Two-Step Cascaded Regression

RefineDet은 객체들의 위치와 사이즈들을 two-step cascaded regression 전략으로 예측한다. 기존의 1-stage detector의 경우 고정된 anchors를 이용하는 반면 해당 모델은 ARM으로 먼저 refined anchors를 생성한다. 이후 refined anchors는 ODM으로 전달되어 더 정확한 예측을 할 수 있게 된다. 특히 작은 객체들을 검출하는 성능이 향상된다. 위의 사진 1을 보면 ODM이 ARM에서 생성된 refined anchors를 전달받는 것을 확인할 수 있다.

Negative Anchor Filtering

잘 분류되는 negative anchors 초기에 제거하고, imbalance 문제를 완화시키기 위해 negative anchor filtering을 사용한다. 학습시 ARM을 통해 생성된 refined anchor box의 negative confidence가 사전에 정한 threshold $\theta$ (i.e., $\theta = 0.99$ )인 경우, ODM 학습 시 해당 anchor box를 버린다. 즉, ODM 학습을 위해 통과하는 refined anchor box들은 hard negative이거나 positive 이다. 추론도 학습과 마찬가지로 refined anchor box의 negative confidence가 $\theta$ 를 넘을 경우 해당 anchor box는 ODM이 detection시 사용하지 않는다.

Training and Inference

Data Augmentation

Robust한 모델을 만들기 위해 SSD에서 사용한 data augmentation들을 사용한다. 즉, 랜덤하게 원본 이미지들을 추가적인 distortion과 flipping을 사용해 expand와 crop을 한다. 자세한 내용은 SSD 논문을 참조.

Backbone Network

Backbone network로 ImageNet dataset으로 사전 학습(pretrain)된 VGG-16과 ResNet-101을 사용한다. RefineDet은 VGG-16과 ResNet외에 다른 일반적인 사전 학습된 backbone network들 사용 가능하다. 논문의 저자들은 DeepLab-LargeFOV와 유사하게 VGG-16의 fc6와 fc7을 subsampling(dilation)을 사용한 conv_fc6와 conv_fc7으로 변환했다. Conv4_3과 conv5_3이 다른 layers들과 다른 feature scales를 갖기 때문에 이를 L2 normalization을 사용해 scaling 한다. 이때 초기 scaling factor는 각각 8,10이고, 역전파를 통해 학습되는 parameter들 이다. 동시에 high-level information과 multiple scale object detection을 위해 VGG-16의 경우 추가적인 layers(conv6_1, conv6_2)를 추가하고, ResNet-101의 경우 추가적인 layer(res6)을 추가한다.
VGG-16 base network의 경우 conv4_3, conv5_3, conv_fc7, conv6_2의 feature maps을 예측에 사용하고, ResNet-101 base network의 경우 res3b3, res4b22, res5c, res6의 feature maps를 예측에 사용한다.

cf) L2 normalization
Object detection, semantic segmentation 등은 성능 향상 및 다양한 크기의 객체를 검출하기 위해 multi-scale feature를 사용한다. 이때, layer의 깊이에 따라 norm이 달라지는 문제가 생긴다. 즉, RefineDet의 예시를 들면 conv4_3과 conv5_3에서 나온 feature maps의 norm이 달라서 detection을 수행할 때 성능 저하를 발생시킬 수 있다는 것이다. 이를 해결하기 위해 학습 가능한 parameter인 scaling factor $\gamma$ 를 활용한 L2 normalization을 수행한다. Normalization은 채널 별로 수행된다. 코드는 다음과 같다. (코드 출처)

import torch
import torch.nn as nn
from torch.autograd import Function
#from torch.autograd import Variable
import torch.nn.init as init

class L2Norm(nn.Module):
    def __init__(self,n_channels, scale):
        super(L2Norm,self).__init__()
        self.n_channels = n_channels
        self.gamma = scale or None
        self.eps = 1e-10
        self.weight = nn.Parameter(torch.Tensor(self.n_channels))
        self.reset_parameters()

    def reset_parameters(self):
        init.constant_(self.weight,self.gamma)

    def forward(self, x):
        norm = x.pow(2).sum(dim=1, keepdim=True).sqrt()+self.eps
        #x /= norm
        x = torch.div(x,norm)
        out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
        return out

Anchors Design and Matching

객체들의 다른 크기들을 다루기 위해 4개의 feature layers를 사용한다. 각각의 feature layer는 하나의 특정한 anchor scale과 3개의 aspect ratio를 갖는다. Anchors의 다른 scales는 image에서 같은 화면 분할 밀도를 보장한다.
RefineDet은 jaccard overlap(=intersection over union)을 사용해 anchors와 ground-truth를 매칭한다. 먼저 가장 큰 overlap score를 갖는 anchor box를 ground-truth로 매칭하고 overlap score가 0.5보다 높은 anchor box들을 ground-truth로 매칭한다.

Hard Negative mining

Matching 단계 이후, ARM에 의해 easy negative anchors들이 제거된 후에도 ODM에서 대부분의 anchor boxes는 negatives 이다. SSD와 유사하게, foreground-background class imbalance를 해결하기 위해 hard negative mining을 사용한다. 즉, negatives와 positives의 비율이 3:1이 되도록 loss가 큰 negatives를 선택해서 사용한다.

Loss Function

RefineDet의 loss function은 ARM loss와 ODM loss 두 부분으로 구성된다. ARM의 경우, 각 anchor에 binary label을 할당하고 refined anchor를 찾기 위해 위치와 크기를 동시에 예측(regress)한다. 이후 negative confidence가 threshold보다 작은 refined anchors를 ODM에서 처리하여 object category들과 정확한 위치와 크기를 예측한다.
Loss function은 다음 수식과 같다.

L({p_i}, {x_i}, {c_i}, {t_i}) = \frac{1}{N_{arm}}(\sum_i L_b(p_i, [l_i^* \ge 1]) \\+ \sum_i [l_i^* \ge 1] L_r(x_i, g_i^*)) + \frac{1}{N_{odm}}(\sum_i L_m(c_i, l_i^*) \\+ \sum_i [l_i^* \ge 1] L_r(t_i, g_i^*))

$i$ : the index of anchor in a mini-batch
$l_i^*$ : ground truth class label of anchor $i$
$g_i^*$ : ground truth location and size of anchor $i$
$p_i, x_i$ : predicted confidence of the anchor $i$ being an object and refined coordinates of the anchor $i$ in the ARM
$c_i, t_i$ : predicted object class and coordinates of the bounding box in the ODM
$N_{arm}, N_{odm}$ : the numbers of positive anchros in the ARM and ODM, repectively

$L_b$ : Binary classification loss
$L_m$ : Multi-class classification loss
$L_r$ : Regression loss (smooth L1 loss)

Smooth L1 loss는 Fast R-CNN(리뷰)에 자세한 설명이 있다. $[l_i^* \ge 1]$ 은 indicator function으로 negative가 아닌 경우 true, 반대의 경우 false 이다. $N_{arm}$ 또는 $N_{odm}$ 이 0인 경우, 각 module의 loss는 0으로 취급한다.

Optimization

Weight initialization, weight decay, optimizer 등에 대한 내용이 나와있다. 자세한 내용은 논문을 참고하면 된다.

Inference

추론시 ARM으로 refined anchors를 생성한 후 ODM으로 각 이미지마다 top 400 high confident detections를 출력한다. 마지막으로 jaccard overlap이 0.45 이상으로 non-maximum suppression을 수행하고, top 200 high confident detections를 갖고 최종 결과를 출력한다.

Experiments

본 논문에서 다얀한 benchmark(PASCAL VOC 2007, PASCAL VOC 2012, MS COCO)에서 실험을 진행한다. 결과는 다음과 같다.

표 1. Experiments on PASCAL VOC

표 2. Experiments on MS COCO

위의 결과들을 통해 RefineDet이 기존의 방법들 보다 빠른 속도와 뛰어난 detection 성능을 확인할 수 있다.
cf) plus model (e.g., RefineDet302+): multi-scale testing strategy for a pair comparison.

표 3. Ablation study

위의 표 3을 통해 RefineDet에서 제안한 방법들이 성능 향상에 효과적이였음을 확인할 수 있다.

Conclusions

본 논문에서 저자들은 single-shot refinement neural network인 RefineDet을 소개했다. RefineDet은 two-inter connected modules(ARM, ODM)으로 구성된다. 이는 2-stage detectors와 1-stage detectors의 장점들을 이용하며 단점들을 해결한 모델이다. 해당 모델을 통해 여러 benchmarks에서 높은 efficiency와 함께 당시 SOTA의 detection 성능을 보여줬다.

References

RefineDet 논문
ParseNet 논문
DeepLab 논문
RefineDet pytorch implement

김상현

Mucha Suerte

이전 포스트

RetinaNet 논문 리뷰

다음 포스트