DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Eunbin Park·2022년 9월 23일

Papers

목록 보기

1/4

Reference

Pengcheng He et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention (2020)

Introduction

Transformer는 self-attention을 통해 각 입력 단어가 attention w8을 parallel하게 연산하도록 한다.
여기서 attention w8은 각 단어끼리의 영향력을 측정한다. 이는 대규모 모델 학습에서 RNNs 보다 더 많은 Parallelization을 허용한다.
18년 이후는 Transformer 기반의 대규머 사전학습 언어모델의 부흥기였다.
PLMs는 Task-Specific Label로 Fine-tuning 되며 많은 NLP task에서 SOTA를 달성해왔다.

본 논문은 기존 PLMS의 주요 두 Novel Techniques

Disentangled Attention

기존 BERT 의 input layer 가 각 단어의 word embedding과 positional embedding 의 총합인 단일 벡터를 사용하여 표현되었다면, 본 논문에서는 content vector와 relative position에 기반한 Disentangled Matrices를 사용한다.
한 단어의 Attention w8이 content 와 relative position에 의존한다는 견해를 기반으로 한다.

e.g.) "deep", "learning" 두 단어는 다른 문장에 있을 때보다 나란히 있을 때 단어 간 의존성이 더 강력해진다.

content와 position을 각각 encoding하는 disentangled attention

Enhanced Mask Decoder

DeBERTa 또한 MLM을 사용하여 사전학습 된다.
DeBERTa는 MLM을 위한 context word의 content와 Position 정보를 사용한다.
Disentangled Attention Mechanism은 Context word에 대한 contect와 relative Position을 먼저 판별하지만, 예측에서의 가장 주요 요소인 Absolute Position은 먼저 판별하지 않는다.

e.g.) "A new store opned besigde the new mall"에서 "store"와 "mall"이 마스킹 되었을 때, 두 단어의 local Context는 유사하되 문장에서 syntactic한 역할이 다르다.

이런 syntacticical 뉘앙스는 문장의 Absolute Position에 크게 의존하기에 Language Modelling Process에서 이를 고려하는 것은 매우 중요하다.

DeBERTa는 softmax layer 바로 앞에서 Absolute word position embedding을 결합한다.

DeBERTa incorporates absolute word position embeddings right before the softmax layer where the model decodes the masked words based on the aggregated contextual embeddings of word contents and positions.

Enhanced Mask Decoder는 absolute position을 decoding layer에서 통합한다.

💡두 기술로 pre-training 단계에서 Training Efficiency가 향상되었으며, NLU 및 NLG Downstream Task에서도 성능 향상을 보였다.

또한 Downstream fine-tuning을 위해 새로운 Virtual Adbersarial Training Method를 제안한다. 본 방법은 모델 일반화(Generalization) 성능 향상에 효과적이다.

Background

Transformer

Transformer-based LM은 Transformer blocks의 스택으로 구성되어 있다. 각 블록은 Fully Connected Positional Feed-forward Network에서 이어지며 Multi-head self-Attention Layer를 포함한다.
Standard self-attention mechanism은 Word Position Information을 인코딩 하기 위한 Natural한 방법이 부족하다. 그러므로 현존하는 접근법에서 Positional Bias를 각 Input word embedding에 추가한 후, 각 Input word가 content와 position에 의존하는 vector로 표현되게끔 한다.

Positional Bias

이는 Absolute Position Embedding이나 Relative Positional Embedding을 사용해 진행된다. 이는 상대 위치 표현이 NLU 및 NLG tasks에 더 효과적이다.

제안된 Disentangled Attention Mechanism은 단어 contect와 position을 각 인코딩하는 두 개의 separate vector를 사용해 표현하고,
단어 간의 attention w8은 content와 relative position에 대한 disentangle Matrices을 사용해 각각 계산된다.
이는 기존 모든 접근방식과 상이하다.

Eunbin Park

다음 포스트