[ Day 18 ]

Dongbin Lee·2021년 2월 17일

2021 부캠 AI Tech

2021부캠AI

목록 보기

16/24

2021 부스트캠프 Day 18.

[Day 18] NLP

Sequence to Sequence with Attention

Seq2Seq Model

Seq2Seq Model with Attention

추가적인 모듈로 Attention을 활용 할 수 있다.
rnn기반의 모델 구조가 앞에서 부터 순차적으로 정보를 읽어주고, 매 time마다 축적해 나아가는데, rnn의 hidden state vector의 dim이 정해져있다는 특성으로 인해 마지막 hidden state vector에 모든 정보를 넣어야한다.
훨씬 이전에 나타났던 정보들은 점차 time이 지나면서 정보가 변질되거나 손실되는 문제가 있다.
Attention을 사용하게 되면 마지막의 hiddnstate vector하나에만 의존하는 것이 아니라 encoder의 각 time hidden state vector를 전체적으로 decoder에 제공해주며, 각 time step에서 필요한 hidden state vector를 우선적으로 제공하여 준다.
Use the attention distribution to take a weighted sum of the encoder hiddn states
The attention ouput mostly contains information the hidden states that received high attention
Concatenate attention output with decoder hidden state, then use to compute $y$ 1 as before
학습시 decoder의 입력으로 ground truth, 매 time step마다 올바른 입력만이 주어지는 학습 방법을 Teacher forcing이라고 부른다.
반면, ground truth를 매 time step마다 넣어주는 것이아니라 예측 값이 잘못되었던, 올바르던 넣어주게되는 방식이 Teacher forcing이 아닌 방식이 된다.
실제 사용해을 때에 가까운 방식은 후자가 된다.
전자는 학습이 빠르고 용이하게 되는 방법으로 학습상황에는 적합하지만, 실제 상황에 맞추기에는 후자가 용이하게 된다.
따라서 학습 초기에는 Teacher forcing방법으로 진행하며, 학습이 어느정도 지난 후에 Teacher forcing가 아닌 방법으로 학습을 진행시키면 학습을 잘 이루어 지도록 할수있다.

Different Attention Mechanisms

concat을 이용한 score계산에 대한 수식의 과정

Luong attention

they get the decoder hidden state at time $t$
then calculate attention scores,
and from that get the context vector which will be concatenated with hidden state of the decoder and then predict the output.

Bahdanau attention

At time $t$ ,
we consider the hidden state of the decoder at time $t$ -1.
Then we calculate the alignment,
context vectors as above.
But then we concatenate this context with hidden state of the decoder at time $t$ -1.
So before the softmax, this concatenated vector goes inside a LSTM unit.

Result

Luong has different types of alignments.
Bahdanau has only a concat-score alignment model.

Attention is Great!

Attention significantly improves NMT performance
- It is useful to allow the decoder to focus on particular parts of the source
Attention solves the bottleneck problem
- Attention allows the decoder to look directly at source. bypass the bottleneck
Attention helps with vanishing gradient problem
- Provides a shortcut to far-away states
Attention provides some interpretability(해석가능성)
- By inspecting attention distribution, we can see what the decoder was focusing on
- The network just learned alignment by itself

Attention Examples in Machine Translation

- It properly learns grammatical orders of words

It skips unnecessary words such as an article

Beam search

앞서 배운 seq2seq모델을 통한 자연어 생성 모델에서 test time에서 보다 좋은 생성결과를 얻을 수있도록 하는 기법

Greedy decoding

매 time step마다 가장 높은 확률의 단어를 선택해 생성하게 된다.

위와 같이 잘못 단어를 예측할 경우가 발생한다.
이를 어떻게 fix해야할까?

Exhaustive search

Ideally, we want to find a (length $T$ ) translation $y$ that maximizes

동시사건에 대한 확률분포 수식을 사용한다.
We could try computing all possible sequences $y$
- This means that on each step $t$ of the decoder, we are tracking $V$ t possible partial translations, where $V$ si the vocabulary size
- This $O$ ( $V$ t) complexity is far too expensive!

Beam search

매 time step마다 하나의 단어만을 고려하는 greedy decoding과 매 time step마다 가능한 모든 조합을 고려하는 Exhausitive search 이 두 사이에 있는 알고리즘.

Core idea : decoder의 매 time step마다 정의해 놓은 어떤 k개의 가지수를 고려하고 time step이 진행되어지더라고 k개를 유지하고 이중 가장확률이 높은 것을 택하는 방법
일반적으로 k(beam size)는 5~10사이의 값으로 설정하게 된다.
Scores are all negative, and a hight score is better
We search for high-scoring hypotheses, tracking the top $k$ ones on each step
Beam search is not guaranteed to find a globally optimal solution.
But it is much more efficient than exhaustive search!

Example

Beam size : $k$ = 2

Stopping criterion

In greedy decoding, usually we decode until the model produces a <END> token.
In beam search decoding, different hypotheses may produce <END> tokens on different timesteps
- When a hypothesis produces <END>, that hypotheses is complete
- Place it aside and condinue exploring other hypotheses via beam search
Usually we continue beam search until:
- We reach timestep $T$ (where $T$ is some pre-defined cutoff)
- We have at least $n$ completed hypothese (where $n$ is the pre-defined cutoff)

Finishing up

We have our list of completed hypotheses
How to select the top one with the highest score?
Each hypothesis 𝑦1, … , 𝑦𝑡 on our list has a score
Problem with this : word의 갯수, sequence의 길이가 다를 경우 상대적으로 짧은 경우 hypothesis joint probility 값이 낮을 것이고, 반대의 경우는 높을 것이다.
Fix : Normalize by length

BLEU score

자연어 생성모델에서 그 결과의 품질, 정확도를 평가하는 방법

Precision(정밀도) and Recall(재현율)

평균을 구하는 방법으로 아래의 세가지(산술평균, 기하평균, 조화평균)이 있다.
F-measure는 조화평균을 사용한 방법이다.

BLEU score

BiLingual Evaluation Understudy(BLEU)
- N-gram overlap between machine translation output and reference sentence
- Compute precision for n-grams of size one to four
- Add brevity penalty (for too short translations)
  - Typically computed over the entire corpus, not on single sentences
    - min과 곱해지는 뒷부분의 수식은 기하평균을 사용한다는 것을 알 수 있다.