[2021.10.05] Transformer in Object Detection

Seryoungยท2021๋…„ 10์›” 7์ผ
0

Boostcamp AI Tech Level2 P-stage Object Detection

๋ชฉ๋ก ๋ณด๊ธฐ
4/7
post-thumbnail

๐Ÿ’ก Advanced Object Detection 1 Lecture
Further Dev in 2 stage Detectors

Transformer

  • NLP์—์„œ long range dependency ํ•ด๊ฒฐ -> vision์—๋„ ์ ์šฉ

Self Attention

Vision Transformer (ViT)

Overview

  1. Flatten 3D -> 2D (Patch ๋‹จ์œ„๋กœ ๋‚˜๋ˆ”)
  2. Learnable embedding ์ฒ˜๋ฆฌ
  3. class & positional embedding ์ถ”๊ฐ€
  4. Transformer
  5. Predict

End-to-End Object Detection with Transformer

Contribution

  • Transformer๋ฅผ ์ฒ˜์Œ์œผ๋กœ Object Detection์— ์ ์šฉ
  • ๊ธฐ์กด์˜ Object Detection์˜ hand-crafted post process ๋‹จ๊ณ„๋ฅผ transformer๋กœ ์—†์•ฐ

Architecture

  • CNN backbone -> Transformer(Encoder-Decoder) -> Prediction Heads
  • Highest level feature map๋งŒ ์‚ฌ์šฉ (๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰)
  • Flatten 2D
  • Positional embedding
  • Encoder
  • 224 x 224 input
  • 7x7 feature map size
  • 49๊ฐœ์˜ feature vector -> encoder input (7x7 flattenํ•ด์„œ ์‚ฌ์šฉ)
  • Decoder
  • Feed Forward Network (FFN)
  • N๊ฐœ(>ํ•œ ์ด๋ฏธ์ง€์— ์กด์žฌํ•˜๋Š” object ๊ฐœ์ˆ˜)์˜ output

Train

  • Ground-truth์—์„œ ๋ถ€์กฑํ•œ object ๊ฐœ์ˆ˜๋งŒํผ no object๋กœ padding ์ฒ˜๋ฆฌ
  • Ground-truth, prediction N:N mapping
  • ๊ฐ ์˜ˆ์ธก ๊ฐ’์ด N๊ฐœ unique - post-process ํ•„์š”X

Swin Transformer

ViT์˜ ๋ฌธ์ œ์ 

  • ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ ํ•„์š”
  • Computational cost ํผ
  • Backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€

ํ•ด๊ฒฐ๋ฒ•

  • CNN๊ณผ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋กœ ์„ค๊ณ„
  • Window -> cost ๊ฐ์†Œ

Architecture

  • Patch Partitioning
  • Linear Embedding
  • Swin Transformer Block
    • Window Multi-head Attention
  • Patch Merging

Patch Partition

(H,W,3)โ†’(H/P,W/P,Pร—Pร—3)(H, W, 3) \to (H/P, W/P, P\times P \times 3)

Linear Embedding

  • ViT์™€ Embedding ๋ฐฉ์‹ ๋™์ผ
  • ViT์—์„œ class embedding ์ œ๊ฑฐ

Swin Transformer Block

  • Attention 2๋ฒˆ ํ†ต๊ณผ

Window Multi-Head Attention (W-MSA)

  • Window ๋‹จ์œ„๋กœ embedding ๋‚˜๋ˆ”
  • Window ์•ˆ์—์„œ๋งŒ transformer ์ˆ˜ํ–‰
  • Window ํฌ๊ธฐ์— ๋”ฐ๋ผ computational cost ์กฐ์ ˆ ๊ฐ€๋Šฅ
  • Window ๋‚ด ์ˆ˜ํ–‰ -> receptive field ์ œํ•œ

Shifted Window Multi-Head Attention (SW-MSA)

  • Receptive field ์ œํ•œํ•˜๋Š” ๋‹จ์  ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด transformer block 2๋ฒˆ์งธ layer์—์„œ ์ˆ˜ํ–‰
  • ๋‚จ๋Š” ๋ถ€๋ถ„๋“ค (A, B, C)๋ฅผ ์˜ฎ๊น€
  • ๋‚จ๋Š” ๋ถ€๋ถ„๋“ค์„ masking ์ฒ˜๋ฆฌ -> self-attention ์—ฐ์‚ฐ X

Patch Merging

(H,W,C)โ†’(H/2,W/2,4C)โ†’(H/2,W/2,2C)(H, W, C) \to (H/2, W/2, 4C) \to (H/2, W/2, 2C)

Summary

  • ์ ์€ Data๋กœ ํ•™์Šต ๊ฐ€๋Šฅ
  • Window ๋‹จ์œ„ -> computation cost ์ค„์ž„
  • CNN๊ณผ ๋น„์Šทํ•œ ๊ตฌ์กฐ -> Backbone์œผ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ

์ถœ์ฒ˜ ๋ฐ ์ฐธ๊ณ 

0๊ฐœ์˜ ๋Œ“๊ธ€