[NLP] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (ACL, 2020)

누렁이·2023년 3월 22일

2019 ACL BART Generation NLP

NLP

목록 보기

4/13

Facebook

Paper: https://aclanthology.org/2020.acl-main.703/
Code:

https://github.com/pytorch/fairseq

huggingface: https://huggingface.co/docs/transformers/model_doc/bart

Ko-BART : https://github.com/SKT-AI/KoBART

Introduction

Background: Denoising autoencoders
- token을 마스킹 하는 등의 방법 (대표적인 것이 BERT 겠쥬?)
기존 Denoising autoencoders 한계점: 특정 task에만 국한되어 있다.(e.g. span prediction, generation, etc.)
Goal: 다양한 task에 적용가능한 pre-train seq2seq LM
Method
- BART (Bidirectional and Auto-Regressive Transformers)
- seq2seq 기반의 denoising autoencoder인데 다양한 task에 적용가능 하다.
  - encoder는 bidirectional encoder
  - decoder는 autoregressive decoder
- Pretraining 방법
  - 1) encoder: noising function 사용해서 text를 corruption 시킨다
  - 2) decoder: 그럼 seq2seq 모델은 corrupted 된 문장을 복구하는 방향으로 학습한다.
장점
- 1) the noising flexibility : 임의로 기존 text를 변형할 수 있음. 자기들이 실험 여러개 해봤는데, shuffling the order of the original sentences and using a novel in-filling scheme이 아주 좋았다. 이게 input 길이 상관없이 쓸 수 있어서 더 긴 문장도 input 받는 거 가능!
- 2) conprehension task에 good: abstractive dialogue, question answering, summarization task 등에 특히 좋더라.
- 3) 새로운 machine translation scheme 제안: BART는 additional transformer layer를 쌓았음. 이건 외국어를 noised한 영어로 번역하는 걸로 training 시킴. back-translation MT에 아주 좋았음.
- 4) 다른 sota들 보다 좋은 성능 냈음.

이전
- elmo는 양방향 서로 interaction이 없구나
UniLM
- 차이점 seq2seq에서는 앞에꺼를 예측한게 뒤에 영향을 미치는데 independent하게 예측한다고 함.
XLNet
정리:
- 이 전까지 encoder를 집중해서, decoder가 좀 떨어짐

Model

BERT와 차이점: 마지막에 cross-attention 수행
어떠한 것도 denoising이 다 가능하다.
xlnet이랑 비슷한거 아냐?
masking 방식
1) token Masking
2) token deletion
- encoder
- decoder : 어디가 없어졌니?
  3) text infilling => 얘가 성능 제일 좋았다고 함.
- 몇개 사라졌는지, mask 뭔지???
  4) permutation
- 순서 섞음
  5) Document Rotation
문장 시작이 어디인지

Experiment

1) sequence clf
2) token clf
3) generation
4) MT
bart decoder의 차이점
이전에는 seq2seq에서 encoding된 vector 하나 가지고 decoding을 하는데, 이게 한계가 있음. 그래서 하나 더 추가 layer를 만들어서 단어들 마다 seperate하게 mapping을 해가지고 s 토큰이랑 각각 토큰 추가로 줘서 decoding잘하게 해줌.
=>> 기존 연구의 decoding부분에 한계점이 있다는 걸 지적한게 멋있구만!!

Result

누렁이

왈왈

이전 포스트

[NLP] SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP, 2021)

다음 포스트

[NLP] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (ACL, 2020)

NLP

Introduction

Model

Experiment

Result

[NLP] SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP, 2021)

[NLP] Mutual Information Alleviates Hallucinations in Abstractive Summarization (EMNLP, 2022)

0개의 댓글

[NLP] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (ACL, 2020)

NLP

Introduction

Related work

Model

Experiment

Result

[NLP] SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP, 2021)

[NLP] Mutual Information Alleviates Hallucinations in Abstractive Summarization (EMNLP, 2022)

0개의 댓글