๐Ÿ˜„ Lecture 07. | Training Neural Networks II

๋ฐฑ๊ฑดยท2022๋…„ 1์›” 21์ผ
0

Stanford University CS231n.ย 

๋ชฉ๋ก ๋ณด๊ธฐ
5/6

๋ณธ ๊ธ€์€ Hierachical Structure์˜ ๊ธ€์“ฐ๊ธฐ ๋ฐฉ์‹์œผ๋กœ, ๊ธ€์˜ ์ „์ฒด์ ์ธ ๋งฅ๋ฝ์„ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋„๋ก ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ๋ณธ ๊ธ€์€ CSF(Curation Service for Facilitation)๋กœ ์ธ์šฉ๋œ(์ฐธ์กฐ๋œ) ๋ชจ๋“  ์ถœ์ฒ˜๋Š” ์ƒ๋žตํ•ฉ๋‹ˆ๋‹ค.

[์š”์•ฝ์ •๋ฆฌ]Stanford University CS231n. Lecture 07. | Training Neural Networks II

1. CONTENTS


1.1 Table

VelogLectureDescriptionVideoSlidePages
์ž‘์„ฑ์ค‘Lecture01Introduction to Convolutional Neural Networks for Visual Recognitionvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture02Image Classificationvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture03Loss Functions and Optimizationvideoslidesubtitle
์™„๋ฃŒLecture04Introduction to Neural Networksvideoslidesubtitle
์™„๋ฃŒLecture05Convolutional Neural Networks videoslidesubtitle
์™„๋ฃŒLecture06Training Neural Networks Ivideoslidesubtitle
์™„๋ฃŒLecture07Training Neural Networks IIvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture08Deep Learning Softwarevideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture09CNN Architecturesvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture10Recurrent Neural Networksvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture11Detection and Segmentationvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture12Visualizing and Understandingvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture13Generative Modelsvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture14Deep Reinforcement Learningvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture15Invited Talk: Song Han Efficient Methods and Hardware for Deep Learningvideoslidesubtitle
์ž‘์„ฑ์ค‘Lecture16Invited Talk: Ian Goodfellow Adversarial Examples and Adversarial Trainingvideoslidesubtitle

2. Flow

2.1 ๋‰ด๋Ÿฐ์˜ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ดํ•ด

2.2 ๋‰ด๋Ÿฐ์˜ ์•„ํ‚คํ…์ณ

  • ์„ ํ˜• ํ•จ์ˆ˜์™€ ๋น„์„ ํ˜• ํ•จ์ˆ˜์˜ ํ•ฉ์œผ๋กœ ์ด์–ด์ง
  • ๋ณดํ†ต Sigmoid ํ•จ์ˆ˜์™€ ReLUํ•จ์ˆ˜๋ฅผ ๋งŽ์ด ์‚ฌ์šฉ

2.3 Fully Connected Layer

  • ๋ชจ๋“  Layer๊ฐ€ ๋ชจ๋‘ ์—ฐ๊ฒฐ๋จ.

2.4 Convolution Neural Networks

  • ๊ณต๊ฐ„์ ์ธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” Layer๋ฅผ ๊ฐ€์ง
  • ํ•„ํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ทธ ํ•„ํ„ฐ๊ฐ€ ์›€์ง์ด๋ฉด์„œ ๋‚ด์ 
  • ํ•„ํ„ฐ๋Š” ํ•˜๋‚˜์˜ ํŠน์ง•์„ ์žก์Œ
  • ์˜ˆ๋ฅผ ๋“ค์–ด
    - 5x5x3์˜ image์—์„œ ํŠน์ง•์„ ์žก์•„๋‚ด๋Š” 5x5x3์˜ filter๋ฅผ ์“ฐ๋ฉด 3x3x1์˜ Convolved Feature๊ฐ€ ๋‚˜์˜ด!

    - cfcf ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง
    • 32x32x3(convolution map:image) ์—์„œ 5x5x3(filter)๋ฅผ ์“ฐ๋ฉด 28x28x1(activation map)์ด ๋‚˜์˜ด
    • ์ด ๋•Œ
      • Stride ๋Š” ํ”ฝ์…€์”ฉ ์ด๋™ํ•  ๊ฒƒ์ธ์ง€ ํ•„ํ„ฐ๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ๋„˜์–ด๊ฐ€๋ฉด ์•ˆ๋จ
        • convolution ๊ณผ์ •์‹œ input์‚ฌ์ด์ฆˆ์™€ activation map์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด Zero-Padding
        • Pooling์€ ๊ฐ•์ œ๋กœ downsampling์„ ํ•˜๊ณ  ์‹ถ์„ ๋–„ depth์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ์—ญํ•  ์ฃผ๋กœ Maxpooling์„ ์‚ฌ์šฉ
        • ๋ณดํ†ต pooling์€ ๋ชจ๋“  ํ”ฝ์…€์ด ํ•œ๋ฒˆ์”ฉ๋งŒ ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜๋„๋ก(window size์™€ Stride๋Š” ๊ฐ™์€ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ๋จ)
    • Feature Map ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์‹ - ์ž์—ฐ์ˆ˜๊ฐ€ ๋˜์–ด์•ผ ํ•จ
      • Feature Map์˜ ํ–‰, ์—ด ํฌ๊ธฐ๋Š” Pooling ํฌ๊ธฐ์˜ ๋ฐฐ์ˆ˜
      • ์ž…๋ ฅ ๋†’์ด : H / ํญ : W
        • ํ•„ํ„ฐ ๋†’์ด : FH / ํญ : FW
        • Stride : S
        • ํŒจ๋”ฉ ์‚ฌ์ด์ฆˆ : P
        • {(์ž…๋ ฅ๋ฐ์ดํ„ฐ์˜ ๋†’์ด +(2xํŒจ๋”ฉ์‚ฌ์ด์ฆˆ)- ํ•„ํ„ฐ๋†’์ด)/Stride์˜ ํฌ๊ธฐ}+1 = ์ถœ๋ ฅ๋†’์ด
    • cfcf CNN ๊ตฌ์„ฑ ์˜ˆ
  • CNN ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์˜ˆ
    - Convolution layer๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์‚ฌ์šฉํ•˜๊ณ  Fully connected layer๋ฅผ ๋งˆ์ง€๋ง‰์— ์‚ฌ์šฉํ•˜๋ฉด CNN ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด ์ง
  • ์ด๋ ‡๊ฒŒ Neural Newtwork๋ฅผ Training ์‹œํ‚ฌ ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ๊ฒƒ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž

2.5 Neural Newtwork๋ฅผ Training ์‹œํ‚ฌ ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ๊ฒƒ

2.5.1 Neural Network Traing์ด๋ž€?

  • network parameter๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ Gradient Descent Algorithm์— ๋Œ€ํ•ด์„œ ๋ฐฐ์› ๋‹ค.
  • ๋ชจ๋“  data๋ฅผ ๊ฐ€์ง€๊ณ  gradient descent Algorithm์— ์ ์šฉ์„ ํ•˜๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— SGD(Stochastic Gradient Descent) Algorithm์„ ์ด์šฉ
  • Sample์„ ๋ฝ‘์•„๋‚ด Gradient Desscent Algorithm์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์—ฌ๊ธฐ์„œ ์ƒ๊ฐํ•ด๋ณผ ๊ฒƒ.

Q1. ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์„ ์ •ํ•ด์•ผ ํ•˜๋Š”๊ฐ€
Q2. Training ํ•  ๋•Œ ์œ ์˜ํ•  ์‚ฌํ•ญ
Q3. ํ‰๊ฐ€๋Š” ์–ด๋–ป๊ฒŒ ํ•  ๊ฒƒ์ธ๊ฐ€.

2.5.2 Activation Function

2.5.2.1 Sigmoid Function

  • ์ถœ๋ ฅ์ด 0~1 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋‚˜์˜ค๋„๋ก ํ•˜๋Š” ์„ ํ˜• ํ•จ์ˆ˜
  • ๋‹จ์ 
    - Saturated neurons๊ฐ€ Gradient๊ฐ’์„ 0์œผ๋กœ ๋งŒ๋“ ๋‹ค.
    - ์›์  ์ค‘์‹ฌ์ด ์•„๋‹ˆ๋‹ค.
    - ์ง€์ˆ˜ํ•จ์ˆ˜๊ฐ€ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค.
    - cfcf
    - Saturate : โ€˜ํฌํ™”โ€™๋ผ๊ณ  ํ•ด์„์„ ํ•˜๋Š”๋ฐ, ์ž…๋ ฅ์ด ๋„ˆ๋ฌด ์ž‘๊ฑฐ๋‚˜ ํด ๊ฒฝ์šฐ ๊ฐ’์ด ๋ณ€ํ•˜์ง€ ์•Š๊ณ  ์ผ์ •ํ•˜๊ฒŒ 1๋กœ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ํฌํ™”๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ณ  Gradient์˜ ๊ฐ’์ด 0์ธ ๋ถ€๋ถ„์„ ์˜๋ฏธ
    • Gradient๊ฐ€ 0์ด ๋˜๋Š” ๊ฒƒ์ด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ด์œ 
      - Chain Rule ๊ณผ์ •์„ ์ƒ๊ฐํ–ˆ์„ ๋•Œ, Global gradient๊ฐ’์ด 0์ด ๋˜๋ฉด ์ฆ‰ ๊ฒฐ๊ณผ ๊ฐ’์ด 0์ด ๋˜๋ฉด local gradient ๊ฐ’๋„ 0์ด ๋œ๋‹ค. ๋”ฐ๋ผ์„œ Input์— ์žˆ๋Š” gradient ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์—†๋‹ค.
      • ์›์  ์ค‘์‹ฌ์ด ์•„๋‹Œ ๊ฒƒ์ด ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ด์œ 
        - output์˜ ๊ฐ’์ด ํ•ญ์ƒ ์–‘์ˆ˜๋ฉด ๋‹ค์Œ input์œผ๋กœ ๋“ค์–ด๊ฐ”์„ ๋•Œ๋„ ํ•ญ์ƒ ์–‘์ˆ˜์ด๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ๋‹ค์Œ layer์—์„œ wห‰\bar w์˜ ๊ฐ’์„ updateํ•  ๋•Œ ํ•ญ์ƒ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ update๊ฐ€ ๋œ๋‹ค. ๋‹ค์Œ ๊ทธ๋ฆผ์˜ ์˜ˆ๋กœ ์„ค๋ช…ํ•˜๋ฉด ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” vector๊ฐ€ ํŒŒ๋ž€์ƒ‰์ผ ๋•Œ, wห‰\bar w์˜ ๊ฐ™์€ ๊ฒฝ์šฐ ์ œ 1์‚ฌ๋ถ„๋ฉด๊ณผ ์ œ 3์‚ฌ๋ถ„๋ฉด์œผ๋กœ update๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ update๋ฅผ ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
        
  • Sigmoid์ด ์›์ ์ค‘์‹ฌ์ด ์•„๋‹Œ ๊ฒƒ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ํ•จ์ˆ˜๊ฐ€ ๋ฐ”๋กœ tanh(x)tanh(x)

2.5.2.2 tanh(x)tanh(x)

  • ์—ฌ์ „ํžˆ saturatedํ•œ ๋‰ด๋Ÿฐ์ผ ๋•Œ, gradient๊ฐ’์ด 0์œผ๋กœ ๋œ๋‹ค.
  • ๊ทธ๋ž˜์„œ ์ƒˆ๋กœ์šด ํ™œ์„ฑํ•จ์ˆ˜๊ฐ€ ํ•„์š” ReLU

2.5.2.3 ReLU

  • ํŠน์ง•
    - (+) ์˜์—ญ์—์„œ saturateํ•˜์ง€ ์•Š๊ณ ,
    • ๊ณ„์‚ฐ ์†๋„๋„ element-wise ์—ฐ์‚ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— sigmoid/tanh๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๋‹ค
  • ๋‹จ์ 
    - (-)์˜ ๊ฐ’์€ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— Data์˜ ์ ˆ๋ฐ˜๋งŒ activateํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ
  • ์ด ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Leaky ReLU์™€ Exponential Linear Unit ๊ณผ Maxout

2.5.2.4 Leaky ReLU

2.5.2.5 Exponential Linear Unit

2.5.2.6 Maxout

  • ์ด ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด parameter๊ฐ€ ๊ธฐ์กด function๋ณด๋‹ค 2๋ฐฐ ์žˆ์–ด์•ผ ํ•œ๋‹ค

2.5.3 Data processing

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋Š” ์ฃผ๋กœ Zero-centered, Normalized, PCA, Whitening๊ฐ™์€ ์ฒ˜๋ฆฌ๋“ค์„ ํ•œ๋‹ค.
  • Zero-centered ๋‚˜ Normalized๋ฅผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ชจ๋“  ์ฐจ์›์ด ๋™์ผํ•œ ๋ฒ”์œ„์— ์žˆ์–ด ์ „๋ถ€ ๋™๋“ฑํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ
  • PCA๋‚˜ Whitening์€ ๋” ๋‚ฎ์€ ์ฐจ์›์œผ๋กœ projectionํ•˜๋Š” ๋Š๋‚Œ์ธ๋ฐ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ๋Š” ์ด๋Ÿฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์€ ๊ฑฐ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ๊ธฐ๋ณธ์ ์œผ๋กœ ์ด๋ฏธ์ง€๋Š” Zero-Centered ๊ณผ์ •๋งŒ ๊ฑฐ์นจ

  • ์‹ค์ œ ๋ชจ๋ธ์—์„œ๋Š” train data์—์„œ ๊ณ„์‚ฐํ•œ ํ‰๊ท ์„ test data์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ

2.5.4 Weight Initialization

  • ์ดˆ๊ธฐ๊ฐ’์„ ๋ช‡์œผ๋กœ ์žก์•„์•ผ ์ตœ์ ์˜ ๋ชจ๋ธ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • ๋งŒ์•ฝ ์ดˆ๊ธฐ๊ฐ’์„ 0์œผ๋กœ ํ•œ๋‹ค๋ฉด, ๋ชจ๋“  ๋‰ด๋Ÿฐ์€ ๋™์ผํ•œ ์ผ์„ ํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๋ชจ๋“  gradient์˜ ๊ฐ’์ด ๊ฐ™๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค.
  • ๊ทธ๋ž˜์„œ ์ƒ๊ฐํ•œ ์ฒซ๋ฒˆ์งธ Idea๋Š” โ€˜์ž‘์€ randomํ•œ ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”โ€™๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • ์ดˆ๊ธฐ Weight๋Š” ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์—์„œ Sampling์„ ํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๊ฒฝ์šฐ ์–•์€ network์—์„œ๋Š” ์ž˜ ์ž‘๋™์„ ํ•˜์ง€๋งŒ network๊ฐ€ ๊นŠ์–ด์งˆ ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.
  • ์™œ๋ƒํ•˜๋ฉด network๊ฐ€ ๊นŠ์œผ๋ฉด ๊นŠ์„์ˆ˜๋ก, weight์˜ ๊ฐ’์ด ๋„ˆ๋ฌด ์ž‘์•„ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • ๋งŒ์•ฝ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ํ‚ค์šฐ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
    - activaton value์˜ ๊ฐ’์ด ๊ทน๋‹จ์ ์ธ ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜๊ณ , gradient์˜ ๊ฐ’์ด ๋ชจ๋‘ 0์œผ๋กœ ์ˆ˜๋ ดํ•  ๊ฒƒ
  • ์ด๋Ÿฐ ์ดˆ๊ธฐ๊ฐ’ ๋ฌธ์ œ์— ๋Œ€ํ•ด์„œ โ€˜Xavier initializationโ€™์ด๋ผ๋Š” ๋…ผ๋ฌธ์ด ์ œ์‹œ ๋˜์—ˆ๋Š”๋ฐ ์ผ๋‹จ activation function์ด linearํ•˜๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ weight์˜ ๊ฐ’์„ ์ดˆ๊ธฐํ™”
  • ์ด ์‹์„ ์ด์šฉํ•˜๋ฉด ์ž…/์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์„ ๋งž์ถœ ์ˆ˜ ์žˆ์Œ
  • ํ•˜์ง€๋งŒ activation function์„ ReLU๋กœ ์ •ํ•œ ๊ฒฝ์šฐ, ์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์ด ๋ฐ˜ํ† ๋ง‰ ๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ์‹์ด ์„ฑ๋ฆฝํ•˜์ง€ ์•Š์Œ
  • ๋ณดํ†ต activation function์ด ReLU์ธ ๊ฒฝ์šฐ์—๋Š” He Initialization์„ ์‚ฌ์šฉ

2.5.5 Batch Normalization

  • ๋งŒ์•ฝ unit gaussian activation์„ ์›ํ•˜๋ฉด ๊ทธ๋ ‡๊ฒŒ ์ง์ ‘ ๋งŒ๋“ค์–ด๋ณด์ž!

  • ํ˜„์žฌ Batch์—์„œ ๊ณ„์‚ฐํ•œ mean๊ณผ variance๋ฅผ ์ด์šฉํ•˜์—ฌ ์ •๊ทœํ™”๋ฅผ ํ•ด์ฃผ๋Š” ๊ณผ์ •์„ Model์— ์ถ”๊ฐ€ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

  • ๊ฐ layer์—์„œ Weight๊ฐ€ ์ง€์†์ ์œผ๋กœ ๊ณฑํ•ด์ ธ์„œ ์ƒ๊ธฐ๋Š” Bad Scaling์˜ ํšจ๊ณผ๋ฅผ ์ƒ์‡„์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ unit gaussian์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ด ๋ฌด์กฐ๊ฑด ์ข‹์€ ๊ฒƒ ์ธ๊ฐ€? ์ด์— ์œ ์—ฐ์„ฑ์„ ๋ถ™์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ๊ณผ ํ‰๊ท ์„ ์ด์šฉํ•ด Normalized๋ฅผ ์ข€ ๋” ์œ ์—ฐํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค.

  • ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” Batch Normalization์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
  • Batch Normalization์˜ ํŠน์ง•์„ ์‚ดํŽด๋ณด๋ฉด

    	- Regularization์˜ ์—ญํ• ๋„ ํ•  ์ˆ˜ ์žˆ๋‹ค. (Overfitting์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.)
    	- weight์˜ ์ดˆ๊ธฐํ™” ์˜์กด์„ฑ์— ๋Œ€ํ•œ ๋ฌธ์ œ๋„ ์ค„์˜€๋‹ค.
    	- Testํ•  ๋• ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†์œผ๋‹ˆ Trainingํ•˜๋ฉด์„œ ๊ตฌํ•œ ํ‰๊ท ์˜ ์ด๋™ํ‰๊ท ์„ ์ด์šฉํ•ด ๊ณ ์ •๋œ Mean๊ณผ Std๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    	- ํ•™์Šต ์†๋„๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

2.5.6 ํ•™์Šต ๊ณผ์ •์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฒ•

  • ์ฒซ๋ฒˆ์งธ๋กœ ๊ณ ๋ ค์•ผ ํ•  ์‚ฌํ•ญ์€ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด๋‹ค.

  • ๋‘๋ฒˆ์งธ๋กœ๋Š” ์–ด๋–ค architecture๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์ธ์ง€ ๊ณจ๋ผ์•ผ ํ•œ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ๊ฐ€์ค‘์น˜๊ฐ€ ์ž‘์€ ๊ฐ’์ผ ๋•Œ loss๊ฐ’์ด ์–ด๋–ป๊ฒŒ ๋ถ„ํฌํ•˜๋Š”์ง€ ์‚ดํŽด๋ด์•ผ ํ•œ๋‹ค.

  • ์šฐ์„  training data๋ฅผ ์ ๊ฒŒ ์žก๊ณ  loss์˜ ๊ฐ’์ด ์ œ๋Œ€๋กœ ๋–จ์–ด์ง€๋Š”์ง€ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด์ž.

  • ์—ฌ๋Ÿฌ Hyperparameter๋“ค์ด ์žˆ๋Š”๋ฐ ๊ทธ ์ค‘ ๊ฐ€์žฅ ๋จผ์ € ๊ณ ๋ ค์•ผ ํ•ด์•ผํ•˜๋Š” ๊ฒƒ์€ Learning rate์ด๋‹ค.

  • training ๊ณผ์ •์—์„œ cost๊ฐ€ ์ค„์–ด๋“ค์ง€ ์•Š์œผ๋ฉด Learning rate๊ฐ€ ๋„ˆ๋ฌด ์ž‘์€์ง€ ์˜์‹ฌ์„ ํ•œ๋ฒˆ ํ•ด๋ณด์ž.

  • ๋‹จ, activation function์ด softmax์ธ ๊ฒฝ์šฐ ๊ฐ€์ค‘์น˜๋Š” ์„œ์„œํžˆ ๋ณ€ํ•˜์ง€๋งŒ accurancy๊ฐ’์€ ๊ฐ‘์ž๊ธฐ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๊ฒƒ์€ ์˜ณ์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

  • cost๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์„œ ๋ฐœ์‚ฐํ•œ๋‹ค๋ฉด, Learning rate๊ฐ€ ๋„ˆ๋ฌด ํฐ์ง€ ์˜์‹ฌ์„ ํ•œ๋ฒˆ ํ•ด๋ณด๊ณ  ๊ณ„์†ํ•ด์„œ ๊ฐ’์„ ์กฐ์ •ํ•ด์•ผ ํ•œ๋‹ค.

2.5.7 Hyperparameter Optimization

  • ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ, ๊ณ ๋ คํ•ด์•ผ ํ•  Hyperparameter๋“ค์ด ์ •๋ง ๋งŽ๋‹ค.

  • ๋ณดํ†ต training set์œผ๋กœ ํ•™์Šต์„ ์‹œํ‚ค๊ณ  validation set์œผ๋กœ ํ‰๊ฐ€๋ฅผ ํ•œ๋‹ค.

  • ๋งŒ์•ฝ Hyperparameter๋ฅผ ๋ฐ”๊ฟจ๋Š”๋ฐ update๋œ cost์˜ ๊ฐ’์ด ์›๋ž˜ cost์˜ ๊ฐ’๋ณด๋‹ค 3๋ฐฐ ์ด์ƒ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•  ๊ฒฝ์šฐ ๋‹ค๋ฅธ parameter๋ฅผ ํ•œ ๋ฒˆ ์จ๋ณด์ž.

  • Hyperparameter์˜ ๊ฐ’์„ ์—ฌ๋Ÿฌ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฑฐ์ณ์„œ ์ •ํ•˜๋Š” ๊ฒƒ๋„ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ์‹œ๊ฐ„์ด ์—†๋‹ค๋ฉด ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ hyperparameter๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

  • ๋”ฐ๋ผ์„œ Grid Search vs Random Serach ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋˜์—ˆ๋‹ค.

    	- Grid Search๋Š” ํƒ์ƒ‰์˜ ๋Œ€์ƒ์ด ๋˜๋Š” ํŠน์ • ๊ตฌ๊ฐ„ ๋‚ด์˜ ํ›„๋ณด hyperparameter ๊ฐ’๋“ค์„ ์ผ์ •ํ•œ ๊ฐ„๊ฒฉ์„ ๋‘๊ณ  ์„ ์ •ํ•˜์—ฌ, ์ด๋“ค ๊ฐ๊ฐ์— ๋Œ€ํ•˜์—ฌ ์ธก์ •ํ•œ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•œ ๋’ค, ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ–ˆ๋˜ hyperparameter ๊ฐ’์„ ์„ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
    
    	- ๋ฐ˜๋ฉด Random Search๋Š” Grid Search์™€ ํฐ ๋งฅ๋ฝ์€ ์œ ์‚ฌํ•˜๋‚˜, ํƒ์ƒ‰ ๋Œ€์ƒ ๊ตฌ๊ฐ„ ๋‚ด์˜ ํ›„๋ณด hyperparameter ๊ฐ’๋“ค์„ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง(sampling)์„ ํ†ตํ•ด ์„ ์ •ํ•œ๋‹ค๋Š” ์ ์ด ๋‹ค๋ฅด๋‹ค. Random Search๋Š” Grid Search์— ๋น„ํ•ด ๋ถˆํ•„์š”ํ•œ ๋ฐ˜๋ณต ์ˆ˜ํ–‰ ํšŸ์ˆ˜๋ฅผ ๋Œ€ํญ ์ค„์ด๋ฉด์„œ, ๋™์‹œ์— ์ •ํ•ด์ง„ ๊ฐ„๊ฒฉ(grid) ์‚ฌ์ด์— ์œ„์น˜ํ•œ ๊ฐ’๋“ค์— ๋Œ€ํ•ด์„œ๋„ ํ™•๋ฅ ์ ์œผ๋กœ ํƒ์ƒ‰์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์ตœ์  hyperparameter ๊ฐ’์„ ๋” ๋นจ๋ฆฌ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ ์‹ค์ œ๋กœ๋Š” random search๊ฐ€ ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

  • ์‹ค์ œ๋กœ Hyperparameter Optimization๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ผ์–ด๋‚œ๋‹ค.

    	1. Hyperparameter ๊ฐ’์„ ์„ค์ •ํ•œ๋‹ค.
    	2. 1์—์„œ ์ •ํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•œ๋‹ค.
    	3. ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(Validation Set)์„ ์ด์šฉํ•˜์—ฌ ํ‰๊ฐ€ํ•œ๋‹ค.
    	4. ํŠน์ • ํšŸ์ˆ˜๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ทธ ์ •ํ™•๋„๋ฅผ ๋ณด๊ณ  Hyperparameter ๋ฒ”์œ„๋ฅผ ์ขํžŒ๋‹ค.
  • Hyperparameter๋ฅผ ์ •ํ•  ๋•Œ loss curve๋ฅผ ๋ณด๊ณ  ์ด hyperparameter๊ฐ€ ์ ํ•ฉํ•œ์ง€ ์•„๋‹Œ์ง€ ํ‰๊ฐ€๋ฅผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

  • ๋งŒ์•ฝ loss curve๊ฐ€ ์ดˆ๊ธฐ์— ํ‰ํ‰ํ•˜๋‹ค๋ฉด ์ดˆ๊ธฐํ™”๊ฐ€ ์ž˜๋ชป๋  ๊ฐ€๋Šฅ์„ฑ์ด ํด ๊ฒƒ์ด๋‹ค.

  • ๊ทธ๋ฆฌ๊ณ  training accuracy์™€ validation accuracy๊ฐ€ gap์ด ํด ๊ฒฝ์šฐ overfitting์ด ๋œ ๊ฐ€๋Šฅ์„ฑ์ด ๋งค์šฐ ๋†’์€ ๊ฒƒ์ด๋‹ค.

  • ๊ทธ gap์ด ์—†์„ ๊ฒฝ์šฐ model capacity๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด๋ด์•ผ ํ•œ๋‹ค. ์ฆ‰, trainingํ•œ dataset์ด ๋„ˆ๋ฌด ์ž‘์€ ๊ฒฝ์šฐ์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

2.5.8 Optimization์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ๋ฒ•๋“ค

  • ๋‹ค์–‘ํ•œ Optimization Algorithm ์†Œ๊ฐœPermalink
    ์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ SGD Algorithm์ด ์žˆ๋‹ค.
    ๊ฐ„๋‹จํ•˜๊ฒŒ SGD Algorithm์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด๋ณด๋ฉด,

Mini batch ์•ˆ์— ์žˆ๋Š” data์˜ loss๋ฅผ ๊ณ„์‚ฐ
Gradient์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์„ ์ด์šฉํ•˜์—ฌ update๋ฅผ ํ•œ๋‹ค.
1๋ฒˆ๊ณผ 2๋ฒˆ ๊ณผ์ •์„ ๊ณ„์†ํ•ด์„œ ๋ฐ˜๋ณตํ•œ๋‹ค.
ํ•˜์ง€๋งŒ SGD Algorithm์—๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌํ•˜๋Š”๋ฐ,

  • Loss์˜ ๋ฐฉํ–ฅ์ด ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ๋น ๋ฅด๊ฒŒ ๋ฐ”๋€Œ๊ณ  ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ๋Š๋ฆฌ๊ฒŒ ๋ฐ”๋€๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ์ธ๊ฐ€?

์ด๋ ‡๊ฒŒ ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐฉํ–ฅ์ด ์กด์žฌํ•œ๋‹ค๋ฉด SGD๋Š” ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค.

  • Local minima๋‚˜ saddle point์˜ ๋น ์ง€๋ฉด ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ์ธ๊ฐ€?

์ตœ์†Ÿ๊ฐ’์ด ๋” ์žˆ๋Š”๋ฐ local minima์— ๋น ์ ธ์„œ ๋‚˜์˜ค์ง€ ๋ชปํ•˜๊ฑฐ๋‚˜,
๊ธฐ์šธ๊ธฐ๊ฐ€ ์™„๋งŒํ•œ ๊ตฌ๊ฐ„์—์„œ update๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

  • Minibatches์—์„œ gradient์˜ ๊ฐ’์ด ๋…ธ์ด์ฆˆ ๊ฐ’์— ์˜ํ•ด ๋งŽ์ด ๋ณ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๊ผฌ๋ถˆ๊ผฌ๋ถˆํ•œ ํ˜•ํƒœ๋กœ gradient ๊ฐ’์ด update ๋  ์ˆ˜ ์žˆ๋‹ค.

์œ„์™€ ๊ฐ™์€ ๋ฌธ์ œ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ Momentum์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•œ๋‹ค.

Momentum์ด๋ž€ ์ž๊ธฐ๊ฐ€ ๊ฐ€๊ณ ์ž ํ•˜๋Š” ๋ฐฉํ–ฅ์˜ ์†๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ gradient update๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

  • momentum์„ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ ๊ธฐ์กด์— ์žˆ๋Š” momentum๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ์–ด update๋ฅผ ์‹œ์ผœ์ฃผ๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋Š”๋ฐ Nesterov Momentum์ด๋ผ๊ณ  ํ•œ๋‹ค.

  • ์‹์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ์ดํ•ดํ•˜์ง„ ๋ชปํ–ˆ์ง€๋งŒ,
    ๊ฐ•์˜์—์„œ๋Š” ํ˜„์žฌ / ์ด์ „์˜ velocity ๊ฐ„์˜ ์—๋Ÿฌ ๋ณด์ •์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๊ณ  ์„ค๋ช…ํ–ˆ๋‹ค.

๊ธฐ์กด์˜ SGD, SGD+Momentum, Nesterov์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ํ•œ ๋ฒˆ ๋น„๊ตํ•ด๋ณด๋ฉด,

์ข€ ๋” Robustํ•˜๊ฒŒ algorithm์ด ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Velocity term ๋Œ€์‹ ์— grad squared term์„ ์ด์šฉํ•˜์—ฌ
gradient๋ฅผ updateํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,

์ด ๋ฐฉ๋ฒ•์€ AdaGrad๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

AdaGrad๋Š” ํ•™์Šต๋ฅ ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด๋‹ค.

grad squared term๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฒŒ ๋˜๋ฉด, ๊ฐ๊ฐ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋งž์ถคํ˜•์œผ๋กœ ๊ฐ’์„ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ update๋ฅผ ๊ณ„์† ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜๋ฉด,
small dimension์—์„œ๋Š” ๊ฐ€์†๋„๊ฐ€ ๋Š˜์–ด๋‚˜๊ณ ,
large dimension์—์„œ๋Š” ๊ฐ€์†๋„๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด ์ง€๋‚ ์ˆ˜๋ก step size๋Š” ์ ์  ๋” ์ค„์–ด๋“ ๋‹ค.

์ด ๋ฐฉ๋ฒ•์—์„œ ๋˜ ํ•˜๋‚˜๊ฐ€ ์ถ”๊ฐ€๊ฐ€ ๋˜์–ด decay_rate๋ผ๋Š” ๋ณ€์ˆ˜๋ฅผ ํ†ตํ•ด์„œ
step์˜ ์†๋„ ๊ฐ€ / ๊ฐ์†์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,

์ด ๋ฐฉ๋ฒ•์„ RMSProp์ด๋ผ๊ณ  ํ•œ๋‹ค.

RMSProp๋Š” AdaGrad์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

๊ณผ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ท ์ผํ•˜๊ฒŒ ๋ฐ˜์˜ํ•ด์ฃผ๋Š” AdaGrad์™€ ๋‹ฌ๋ฆฌ,
RMSProp์€ ์ƒˆ๋กœ์šด ๊ธฐ์šธ๊ธฐ ์ •๋ณด์— ๋Œ€ํ•˜์—ฌ ๋” ํฌ๊ฒŒ ๋ฐ˜์˜ํ•˜์—ฌ update๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.

์ •๋ง ์ˆ˜๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋Š”๋ฐ,
์ด์ œ ๋Œ€์ค‘์ ์œผ๋กœ ๋„๋ฆฌ ์“ฐ์ด๊ณ  ์žˆ๋Š” Adam์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž.

Adam์€ ์‰ฝ๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด momentum + adaGrad ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

์ดˆ๊ธฐํ™”๋ฅผ ์ž˜ ํ•ด์ฃผ์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, bias correction์„ ์ถ”๊ฐ€ํ•˜์—ฌ
์ดˆ๊ธฐํ™”๊ฐ€ ์ž˜ ๋˜๋„๋ก ์„ค๊ณ„ํ•ด ์ฃผ์—ˆ๋‹ค.

์•ž์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•œ๋ฒˆ ๋น„๊ต๋ฅผ ํ•ด๋ณด๋ฉด,

-Adam์ด ์ œ์ผ ๋Œ€์ค‘์ ์œผ๋กœ ์“ฐ์ธ๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ
์—ฌ๊ธฐ ๋ณด์—ฌ์ค€ ์˜ˆ์ œ์—์„œ๋Š” ์ข€ ๋ฉ€~๋ฆฌ ๋Œ์•„์„œ update๊ฐ€ ๋œ ๊ฒƒ ๊ฐ™๊ธดํ•˜๋‹ค.

์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด ๋ชจ๋‘ ๋‹ค๋ฅด๋‹ค!

์ง€๊ธˆ๊นŒ์ง€ ๋ณด์—ฌ์ค€ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€
๋ชจ๋‘ Learning rate๋ฅผ hyperparameter๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Learning rate decay๋„ ์žˆ์ง€๋งŒ
์ฒ˜์Œ์—๋Š” ์—†๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ  ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•œ ๋‹ค์Œ,
๋‚˜์ค‘์— ๊ณ ๋ คํ•ด์ฃผ๋„๋ก ํ•˜์ž.

์ผ์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œ์ผœ ์ตœ์ ํ™”๋ฅผ ์‹œํ‚ฌ ๋•Œ๋Š” ๋ฉ€๋ฆฌ ๊ฐˆ ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

์ด์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œํ‚ฌ๋•Œ๋Š” ์ฃผ๋กœ ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ๊ทผ์‚ฌํ™”๋ฅผ ์‹œํ‚จ๋‹ค.
์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ update๋ฅผ ์‹œํ‚ค๋ฉด ๊ธฐ๋ณธ์ ์œผ๋กœ learning rate๋ฅผ ์„ค์ •ํ•ด ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. (No Hyperparameters!)
ํ•˜์ง€๋งŒ ๋ณต์žก๋„๊ฐ€ ๋„ˆ๋ฌด ํฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

์ด์ฐจ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™” ์‹œํ‚ค๋Š” ์ผ์€ Quasi-Newton ๋ฐฉ๋ฒ•์œผ๋กœ
non-linearํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ• ์ค‘์— ํ•˜๋‚˜์ด๋‹ค.

Newton methods๋ณด๋‹ค ๊ณ„์‚ฐ๋Ÿ‰์ด ์ ์–ด ๋งŽ์ด ์“ฐ์ด๊ณ  ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
๊ทธ ์ค‘ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ BGFS์™€ L-BGFS์ด๋‹ค.

์ด๋Ÿฌํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ full-batch์ผ ๋•Œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—,
Stochastic(ํ™•๋ฅ ๋ก ์ ) setting์ด ์ ์„ ๊ฒฝ์šฐ ์‚ฌ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘
Training ๊ณผ์ •์—์„œ error๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด๋‹ค.
๊ทธ๋ ‡๋‹ค๋ฉด ํ•œ ๋ฒˆ๋„ ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ?

2.5.9 Regularization

Regularization ๊ธฐ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ธฐ ์ „์—, Model Ensembles์— ๋Œ€ํ•ด์„œ ํ•œ ๋ฒˆ ์ •๋ฆฌํ•˜์ž.

Model Ensembles์€ ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋กœ train์„ ์‹œํ‚ค๊ณ ,
test๋ฅผ ํ•  ๋•Œ ๊ทธ ๊ฒƒ๋“ค์„ ์งฌ๋ฝ•(?)ํ•ด์„œ ์“ฐ๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

Test๋ฅผ ํ•  ๋•Œ, parameter vector๋“ค์„ Moving average๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ
test๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. (Polyak averaging)

์ง€๊ธˆ๊นŒ์ง€์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘ Test๋ฅผ ํ•˜๋Š”๋ฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ์ข€ ๋” robustํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•๋“ค์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด single-model์˜ ์„ฑ๋Šฅ์„ ์ข‹๊ฒŒ ํ•˜๊ธฐ์œ„ํ•ด์„  ์–ด๋–ค ๋ฐฉ๋ฒ•์„ ์“ธ๊นŒ?

๋‹ต์€ Regularization์ด๋‹ค.

Regularization์€ ๊ฐ„๋‹จํžˆ loss function์„ ๊ตฌํ˜„ํ•  ๋•Œ,
regularization์— ๋Œ€ํ•œ function์„ ์ถ”๊ฐ€ํ•ด์ฃผ๊ธฐ๋„ ํ•œ๋‹ค.

๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” dropout์ด๋ผ๋Š” ๊ธฐ๋ฒ•๋„ ์žˆ๋‹ค.

Dropout์ด ํšจ๊ณผ๊ฐ€ ์žˆ๋Š” ์ด์œ ๋Š” ๋‹ค์–‘ํ•œ feature๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋–ค ํŠน์ • feature์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•œ๋‹ค.
๋˜ํ•œ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์•™์ƒ๋ธ” ํšจ๊ณผ๊ฐ€ ๋‚  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

Test-time์—์„œ ์ž„์˜์„ฑ์— ๋Œ€ํ•ด ํ‰๊ท ์„ ๋‚ด๊ณ  ์‹ถ์„ ๋•Œ..

Dropout์„ ํ•˜๊ฒŒ ๋˜๋ฉด test time๋„ ์ค„์–ด๋“ค๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋˜ ๋‹ค๋ฅธ regularization ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” Data Augmentation์ด ์žˆ๋‹ค.

Training์„ ์‹œํ‚ฌ ๋•Œ, ์ด๋ฏธ์ง€์˜ patch๋ฅผ randomํ•˜๊ฒŒ ์žก์•„์„œ ํ›ˆ๋ จ์„ ์‹œํ‚ค๊ฑฐ๋‚˜,

์ด๋ฏธ์ง€๋ฅผ ๋’ค์ง‘์–ด์„œ train dataset์— ์ถ”๊ฐ€ํ•ด ํ›ˆ๋ จ์„ ํ•ด์ฃผ๊ฑฐ๋‚˜,

๋ฐ๊ธฐ๊ฐ’์„ ๋‹ค๋ฅด๊ฒŒ ํ•ด์„œ train dataset์— ์ถ”๊ฐ€ํ•˜๊ณ  ํ›ˆ๋ จ์„ ํ•ด์ฃผ๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.

์ด ์™ธ์—๋„ ๋‹ค์–‘ํ•œ regularization ๋ฐฉ๋ฒ•๋“ค์ด ์กด์žฌํ•œ๋‹ค.

2.5.10 Transfer Learning

์ „์ดํ•™์Šต์€ ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด ์ด๋ฏธ pretrained๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์šฐ๋ฆฌ๊ฐ€ ์ด์šฉํ•˜๋Š” ๋ชฉ์ ์— ๋งž๊ฒŒ fine tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•œ๋‹ค.

Small Dataset์œผ๋กœ ๋‹ค์‹œ training ์‹œํ‚ค๋Š” ๊ฒฝ์šฐ
๋ณดํ†ต์˜ learning rate๋ณด๋‹ค ๋‚ฎ์ถฐ์„œ ๋‹ค์‹œ training์„ ์‹œํ‚จ๋‹ค.

DataSet์ด ์กฐ๊ธˆ ํด ๊ฒฝ์šฐ, ์ข€ ๋” ๋งŽ์€ layer๋“ค์„ train ์‹œํ‚จ๋‹ค.

ํ•œ ๋ฒˆ ๋” ํ‘œ๋กœ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์ „์ดํ•™์Šต์€ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ˆ ๊ผญ ์•Œ์•„๋‘์ž!

profile
๋งˆ์ผ€ํŒ…์„ ์œ„ํ•œ ์ธ๊ณต์ง€๋Šฅ ์„ค๊ณ„์™€ ์Šคํƒ€ํŠธ์—… Log

0๊ฐœ์˜ ๋Œ“๊ธ€