A Recipe for Training Neural Networks 정리

EuiyeonKim·2020년 4월 12일

Andrej Karpathy DeepLearning

Andrej Karpathy가 작성한 딥러닝 개발자 레시피 정리

NN의 문제점

Neural Network는

leaky abstraction이다
조용히 망가진다

*leaky abstraction : 한 계층의 역할과 책임이 불필요하게 다른 계층으로 전이되는 문제

Recipes

1. Become one with the data

직접 sample을 보고 distribution을 이해한다 → 사람은 이걸 잘하는지
이해하는데 어떤 사고과정을 거쳤나 곰곰히 돌이켜보면 어떤 모델을 사용할지 감이 잡히기도 한다
Sorting이나 filtering을 통해 dataset의 outlier등을 살펴본다.

2. Set up the end-to-end skeleton + get dumb baselines

모든 것을 간소화해서 버그가 없는지 확인한다

Fix random seed

Simplify

Data augmentation, model 구조 등을 간소화

Add significant digits to your eval

Evaluation 할 때는 batch가 아닌 전체 test set에 대해 loss 계산

Verify loss init

Softmax의 경우 첫 loss는 아래와 같아야 한다.
$-log(\frac{1}{n\_classes})$

Init well

Output의 평균이 50이라면 final layer의 bias 평균이 50이 되도록 initialize 한다,
If you havewan imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization.

Human baseline

Accuracy등 해석할 수 있는 metric을 사용해라
될 수 있다면 직접 evaluation 해봐라 →사람의 metric이 얼마나 되는지 확인

Input independent baseline

Over-fit one batch

적은 example만 포함하여 over-fitting을 시켜 lowest achievable loss까지 train 시켜본다.
Train 끝나고 visualize해서 정말 잘 학습했는지 확인

Verify decreasing training loss

모델을 조!금! 늘려보고 training loss가 잘 감소하나 확인

Visualize before net

Input이 제대로 들어가는지 항상 확인할것!

Visualize prediction dynamics

작은 batch에 대해 testing할 때 model의 prediction이 어떤식으로 변하는지 visualize해보자
Amount of jittering 으로 learning rate가 적당한지 확인할 수 있다

Use back-prop to chart dependencies

One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example i, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the i-th input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.

Generalize special cases

Detail부터 작성하고 generalize해라
작은 것 부터 시작해서 잘 작동하는지 확인하고 scaling 하는게 핵심

3. Over-fit

위의 과정에서 잘 동작하는 멍청한 모델을 얻었다

Over-fit할 수 있을만큼 큰 model 찾기
Regularize it

1에서 뭐가 안된다 싶으면 Model 구조 외의 issue, bugs, misconfiguration 존재함을 암시한다

Picking model

일단 비슷한 project를 찾아서 복붙하고 시작해라
잘되는걸 고쳤을 때 안되면 고친 부분에 문제가 있음을 알 수 있다

Adam(3e-4) is safe

경험상 Adam이 learning rate 포함한 여러 hyper-parameters에 forgiving하다
ConvNet의 경우 거의 항상 well-tuned SGD가 Adam보다 조금 나음
→ 더 나은 region이 좁고 problem specific하긴 함

Complexify only one at a time

한 번에 하나씩만 바꾸자 그래야 뭐가 잘못됐는지 알지

Do not trust learning rate decay defaults

왠만하면 learning rate decay는 빼고 마지막에 tuning합시다

4. Regularize

이제 training set에 대해 over-fit된 모델을 얻었으니 regularize를 해보자
training accuracy를 낮추고 validation accuracy를 취할 차례

Get more data

이게 사실 제일 좋은 방법임 하니면 앙상블

Data augmentation

Half fake data같은 Aggressive augmentation 시도해 볼것

Creative augmentation

fake data
Domain randomization, simulation data, inset data into scenes, even GAN

Pretrain

나쁠게 없다

Stick with supervised learning

Smaller input dimensionality

Input dimension이 클 필요가 없다 → ex) resolution
Remove features that may contain spurious signal. Any added spurious input is just another opportunity to overfit if your dataset is small. Similarly, if low-level details don’t matter much try to input a smaller image.

Smaller model size

As an example, it used to be trendy to use Fully Connected layers at the top of backbones for ImageNet but these have since been replaced with simple average pooling, eliminating a ton of parameters in the process.

Decrease batch size

BN쓸 때 mean, std가 완벽하지 않아서 더 wiggle해진다 → more regularization이 가능해짐

Drop

Dropout써라 BN이랑 조화가 안좋으니 너무 많이 쓰지는 말고(근거)
ConvNet에서는 dropout2d(spatial dropout)를 써라 →Filter마다 dropout

Weight decay

Early stopping

Validation loss보고 over-fit인지 아닌지 판단

Try larger model

Larger model이 over-fit되기 쉽긴하다 근데,
early stopping version of larger model이 smaller model보다 outperform할 때가 많다.
위 과정이 다 끝나고 나면 first-layer의 weight를 visualize 해보자
filter의 Edge가 선명하면 make sense, noise같이 생겼으면 뭔가 잘못됐을 가능성 농후

5. Tune

Random over grid-search

hyper-parameter search할 때 random search가 더 낫다

Hyper-parameter optimization

Bayesian hyper-parameter optimization이 잘 된다고는 하는데, 인턴 시키는게 더 낫더라 ^^

6. Squeeze out the juice

다 됐는데 그 중에서도 성능을 더 높이고 싶을 때

Ensembles

2% 정도 accuracy를 높일 수 있다
inference할 때 감당이 안되면 knowledge distillation을 써봐라

Leave it training

네트워크 학습은 생각보다 오래 지속된다
까먹고 방학동안 학습하게 냅뒀더니 SOTA를 찍었더라

EuiyeonKim

병아리 딥러닝 개발자

다음 포스트