Batch normalization
makes your hyperparameter search problem much easier,Let's see how Batch normalization works
.
When training a model, such as logistic regression,
you might remember that normalizing the input features can speed up learnings
in compute the means, substract off the means from your training sets, compute the variances, and then normalize your data set according to the variances.
And we saw in an earlier video how this can turn the contours of you learning problem from something that might be very elongated(길쭉한) something that is more round(둥근),
and easier for an algorithm like gradient descent to optimize.So this works,
in terms of normalizing the input features values to a neural network, alter the logistic regression
In the case of logistic regression,
we saw how normalizing maybe helps you train more efficiently.
How about a deeper model?
You have not just input features , but you have activations
So if you want to train the parameter say would't be nice
if you can normalize the mean and variance of to make the training of more efficient?
So here, the question is, for any hidden layer,
Can we normalize the values of so as to train faster?
(의 평균과 분산을 normalizing하여 의 training을 더 효율적으로 만들 수 있지 않을까?)
We will actually normalize values of not but
In practice, normalizing is done much more often.
➡️ So this is what batch norm(= batch normalization) does.
In practice, Batch Norm is applied with mini-batches of you training set
.gradient descent using Batch Norm
.So why does batch norm work?
Here's one reason
,
you've seen how normalizing the input features,
the , to mean and variance , how that can speed up learning.
So rather than having some features that range from to ,
and some from to , by normalizing all the features.
Input features to take on similar range of values that can speed up learning.
(가 비슷한 범위의 값을 갖고, 학습 속도가 증가)
So one intuition behind why batch norm works is,
this is doing a similar thing, but further values in hidden units and not just for your input layer.
This is just a partial picture for what batch norm is doing.
There are couple of further intuitions, that will help you gain a deeper understanding of what batch norm is doing.
A second reason why batch norm works is it makes weights later or deeper thatn your networks
,
Let's see a training on network, a shalllow network like logistic regression or a neural nework.
Let's say that you've trained your data sets on all images of black cats.
(검은 고양이 사진만으로 훈련했다고 가정.)
If you now try to apply this network to data with colored cats,then your cost might not do very well.
(만약 색깔있는 고양이를 test한다면, cost가 잘 동작하지 않을 수 있다.)
You might not expect a module trained on the data on the black to do very well on the data on the colored.
So this idea of your data distribution changing goes by the somewhat fancy name,
convariate shift
.
(데이터의 분포가 변하는 아이디어를 covariate shift라고 한다.)
And the idea is that, if you learned some to mapping,
if the distribution of changes, then you might need to retain your learning algorithm.
(만약 X에서 Y로 가는 매핑을 배운 경우에 X의 분포가 변경되면 학습 알고리즘을 다시 훈련시켜야 한다.)
And this is true even if the function, the ground truth function, mapping from to ,
remains unchanged, which it is in this example.
(위의 고양이 예제와 같이 ground truth 함수가 변하지 않더라도 X의 분포가 바뀌면, 다시 훈련시켜야 한다.
ground truth 함수는 그림이 고양이인지 아닌지에 대한 것이기 때문에)
And the need to retain your function becomes even more acute(격렬한, 심한) or
it becomes even worse if the ground truth function shifts as well.
(ground truth 함수도 함께 shift되면 더욱 함수를 다시 유지시켜야할 필요성이 강해진다.)
So how does this problem of covariate shift apply to a neural network?
second effect
.slight regularization effect
.Batch Norm은 한 번에 하나의 mini-batch로 처리하지만,
test time에는 한 번에 하나의 example에 대해서 처리해야 할 수도 있다.
Let's see how you can adapt your network to do that.
Here are the equations you'd use to implement batch norm.
Notice that and which you need for this scaling caculation
are computed on the entire mini-batch.
But the test time you might not have a mini-batch of 64, 128 or 256 examples to process at the same time.
So, you need some different way of coming up with and .
And if you have just one example, taking the mean and variance of that one example doesn't make sense. (1개의 값을 가지고 계산하는 것은 말이 안된다.)
In order to apply your neural network and test time is to come up with some separate estimate of and ,
so what's actually done?
➡️ what you do is estimate this using a exponentially weighted average
where the average is across the mini-batches.
Let's pick some layer and let's say ou're going through mini-batches .
So that exponentially weighted average becomes your estimate for what the mean of the is for that hidden layer
and similarly, you use an exponentially weighted average to keep trakc of variance.
So you keep a running average of the and that you're seeing for each layer as you train the neural network across different mini-batches.
Then finally at test time, what you do is in place of this equation,
you would just compute using your exponentially weighted average of the and .
So the takeaway from this is that
during training time and are computed on an entire mini-batch of say 64, 128 or some number of examples.
But the test time, you might need to process a single example at a time.
So, the way to do that is to estimate and from your training set and there are many ways to do that.
In practice, what people usually do is implement and exponentially weighted average where you just keep track of and during training,
also sometimes called the running average.
In practice, this process is pretty robust to the exact way you used to estimate and .
So i wouldn't worry too much about exactly how you do this and if you're using a deep learning framework,
they'll usually have some deafult way to estimate the and that should work reasonably well as well.