Loss Function and Optimization

민정·2022년 6월 1일
0

How can we tell whether weight of a linear classifier W is good or bad?
To quantify a "good" W, loss function is needed.
Starting with random W and find a W that minimizes the loss to optimize weights of a linear classifier.

Type of loss function

(1) Hinge Loss

✔ Binary hinge loss (=binary SVM loss)

There are only two classes: positive or negative.
(1) If isti^{st} image is in positive class, then class label yiy_i is +1+1,
(2) If isti^{st} image is in negative class, then class label yiy_i is 1-1.

Li=max(0,1yi×s)s=WT×xi+b\begin{aligned} L_i&=max(0,1-y_i\times s)\\ s &= W^T\times x_i + b\\ \end{aligned}

To minimize the loss (\to Li=0L_i=0), yi×s1y_i\times s\geq1 should be satisfied.
There are two cases where yi×s1y_i\times s\geq1:
(1) isti^{st} sample is in positive class: yi=1,then  s1y_i = 1,\quad then\; s \geq 1
(2) isti^{st} sample is in negative class: yi=1,then  s1y_i = -1,\quad then\; s \leq -1
∴ There are margin (1 ~ -1) between positive class and negative class.

✔ Hinge loss (=multiclass SVM loss)

For the number of class c(>2)c(>2),

Li=j=1,j!=yicmax(0,1(sjsyi))s=W×xi+bW=(w1Tw2T...wcT),s=(s1s2...sc)\begin{aligned} L_i &= \sum_{j=1,j\, !=y_i}^{c}{max(0, 1-(s_j-s_{y_i}))}\\ s&=W\times x_i+b \end{aligned}\\ W=\begin{pmatrix}w_{1}^T\\w_{2}^T\\ ...\\w_{c}^T\end{pmatrix}, s=\begin{pmatrix}s_1\\s_2\\...\\s_c\end{pmatrix}

To minimize the loss (=Li=0L_i=0), syisj1s_{y_i}-s_j\geq1 should be satisfied.
In other word, isti^{st} image is in yisty_{i}^{st} class should be satisfied the following condition:

syis11syis21...syisyi11syisyi+11...syisc1\begin{aligned} s_{y_i}-s_1&\geq1\\ s_{y_i}-s_2&\geq1\\ ...\\ s_{y_i}-s_{y_i-1}&\geq1\\ s_{y_i}-s_{y_i+1}&\geq1\\ ...\\ s_{y_i}-s_{c}&\geq1 \end{aligned}

∴ There are margin 1 between yiy_i class and other classes.
yisty_{i}^{st} class must have the largest ss value, and must be at least 1 greater than the ss value of the other classes.

(2) Log Likelihood Loss

Where jj satisfies zij=1z_{ij}=1,

Li=logPjP=(p1p2...pc)\begin{aligned} L_i&=-\log P_j\\ P&=\begin{pmatrix}p_1\\p_2\\...\\p_c\end{pmatrix}\\ \end{aligned}

P is not a score, but probability.
pjp_j is a probability that the ithi^{th} image belongs to the jthj^{th} class.
⁕ class label for ithi^{th} image zijz_{ij} is 00 or 11

As pp is closer to 11, loss is minimized. ( ∵log(1)=0)-\log(1) = 0)

Let's suppose that ithi^{th} image belongs to class 2(j=2)(j=2) and cc=10 .
Then zi,Liz_i, L_i are as below :

zi=(010...0),p=(0.10.70...0.2)Li=log0.7z_i=\begin{pmatrix}0\\1\\0\\...\\0\end{pmatrix}, p=\begin{pmatrix}0.1\\0.7\\0\\...\\0.2\end{pmatrix}\\ ∴ L_i=-log0.7

(3) Cross-entropy Loss

Li=j=1c(zijlogpj+(1zij)log(1pj))L_i=-\sum_{j=1}^{c}{(z_{ij}\log p_j+(1-z_{ij})\log (1-p_j))}

(1) If zij=0z_{ij}=0, then sum log(1pj)-\log (1-p_j) to LiL_i.
(2) If zij=1z_{ij}=1, then sum log(pj)-\log (p_j) to LiL_i.

⁕ Softmax Activation Function

Probability is proportional to escoree^{score}, so we can compute probability using scores. The function below is the softmax activation function.

P(Y=kX=xi)=pk=eskj=1cesjP(Y=k | X=x_i)=p_k=\frac{e^{sk}}{\sum_{j=1}^{c}{e^{sj}}}

✔ Softmax + Log likelihood loss:

We can apply Log likelihood loss using probabilites of softmax function.
This is often called 'softmax classifier'.

✔ Softmax + Cross-entropy loss:

We can apply Cross-entropy loss using probabilites of softmax function.

✔ Compare with SVM (Hinge loss)

Softmax Classifier constantly tries to make the loss small because the loss never becomes zero.
However, if the margin is 1 or higher, the SVM does not try to reduce the loss because the loss is already zero.

(4) Regression Loss

Regression Loss function is widely used in pixel-level prediction. (Ex) image denoising
Using L1 or L2 norms :

Li=yisiorLi=(yisi)2\begin{aligned} L_i&=\vert y_i-s_i\vert\\ or\quad L_i&=(y_i-s_i)^2 \end{aligned}

Regularization

Suppose that we found a W such that L=0.
Is this W unique?
No. 2×W2\times W is also has L=0.

For this reason, regularization loss is needed.
In the function below, λR(W)λR(W) is the part of regularization.

L=1Ni=1NLi(f(xi,W),yi)+λR(W)L=\frac{1}{N}\sum_{i=1}^{N}{L_i(f(x_i,W),y_i)+λR(W)}

✔ 4 types of Regularization:

Optimization

(1) Gradient Descent

Gradient Descent is the most simple approach to minimizing a loss function.

WT+1=WTαLWTW^{T+1}=W^T-α\frac{∂L}{∂W^T}

(2) Stochastic Gradient Descent(SGD)

Full sum is too expensive when N is large.
Instead, approximate sum using a minibatch of examples 32/64/128 common.

L=1Ni=1NLi(f(xi,W),yi)+λR(W)LW=1Ni=1NLi(f(xi,W),yi)W+λR(W)WL=\frac{1}{N}\sum_{i=1}^{N}{L_i(f(x_i,W),y_i)+λR(W)}\\ \frac{∂L}{∂W}=\frac{1}{N}\sum_{i=1}^{N}{\frac{L_i(f(x_i,W),y_i)}{∂W}+λ\frac{∂R(W)}{∂W}}

0개의 댓글