[Paper Review] A Contrastive Framework for Neural Text Generation

Gunsoo Han·2022년 2월 22일

NLP SimCTG contrastive text generation

A Contrastive Framework for Neural Text Generation

URL: https://arxiv.org/pdf/2202.06417.pdf
Code: https://github.com/yxuansu/simctg
Year: 2022

Motivation

Conventional approach of training a language model trains with maximum likelihood estimation (MLE) and decodes the most likely sequence
Despite its simplicity, it leads to the problem of degeneration;
- Generated texts from the language model tend to be dull and contain undesirable repetitions at different levels (e.g., token-, phrase-, and sentence-level)
To alleviate this problem, previous solutions modify the decoding strategy by sampling from less likely vocabularies.
- Nucleus Sampling (a.k.a Top-p sampling)
  - Compute the cumulative distribution and cut off as soon as the CDF exceeds P
  - While nucleus sampling reduces the generated repetition, it introduces another critical problem of semantic inconsistency ;
    
    The sampled text tends to diverge from or even contradict to the original semantics defined by the human-written prefix
Another approach addresses the degeneration probelm by modifying the model’s output vocabulary distribution with unlikelihood training
- Unlikelihood Training

Observation

Humans often choose words that surprise language models (Holtzman et al 2019)

Probability assigned to tokens generated by Beam Search and humans, given the same context
Furthermore, it is observed that the cosine similarities between tokens within a sentence are over 0.95, meaning that these representations are close to each other as shown in Figure 1(a)
- Such high similarity in undesirable as it can naturally cause the model to generate repetitive tokens at different steps, leading to degeneration
- In an ideal setting,
  - the token representations of the model should follow an isotropic distribution, i.e. the token similarity matrix should be sparse and the representations of distinct tokens should be discriminative as shown in Figure 1(b)
  - Moreover, during decoding, the sparseness of token similarity should be preserved to avoid model degeneration.

Approach

Contrastive Training

Based on the above motivations, SimCTG (a simple contrastive framework for neural text generation) is proposed to encourage the model to learn discriminative and isotropic token representations.
Introduce a contrastive objective $L_{CL}$ during pre-training of the language model
- Mathematical Expression
  - $L_{MLE}$
  - $L_{CL}$

Contrastive Search

Also present a novel decoding strategy, contrastive search, which aims at -
1. At each decoding step, the output should be selected from the set of most probable candidates predicted by the model to better maintain the semantic coherence
2. The sparseness of the token similarity matrix of the generated text should be preserved to avoid degeneration
Mathematical Expression
- $V^{(k)}$ is the set of $k$ most-probable candidates predicted by the language model (usually set 3~10)
- This implies that
  1. While selecting one of most probable tokens $v$ (model confidence)
  2. $v$ needs to be as dissimilar as possible from $x_{x<t}$ (degeneration penalty)

Experiment

Evaluate GPT2 base model (117M) with different training objectives - MLE, Unlikelihood and SimCTG - on two tasks with a wide range of metrics

Task1. Document Generation

Evaluate on Wikitext-103 dataset

Language Modeling Quality
- SimCTG achieves the best score on both perplexity (ppl) and next token accuracy (acc)
  - SimCTG can make discriminative representations between generated text hence it is less confusing when making next token prediction
- Unlikelihood yields the best result on degeneracy metric (rep, wrep), but at the expense of unfavorable performance drops in ppl and acc
Generation Quality
- “SimCTG + contrastive search” is the optimal combination
- “{MLE, Unlikelihood} + contrastive search” could boost the result compared with greedy/beam search

Task2. Open-ended Dialogue Generation

Evaluate on LCCC (Chinese) and Daily Dialogue (english)

“SimCTG + contrastive search” is once again the best combination
“MLE+ contrastive search” is can be a better option than SimCTG + @

Ablation

Self-Similarity (Figure 2)
- In the intermediate layers(0~11), the self-similarity scores of different models are relatively the same.
- In contrast, at the output layer (layer 12), SimCTG’s self-similarity becomes notably lower than other baseline
Effect of Margin $\rho$ (Figure 3)
Margin loss with 0.5 shos the best result

Contrastive Search vs Nucleus Sampling (Figure 4)
- Human performance marked with purple
- Nucleus Sampling
  - To meet human-level diversity, it requires high $p \sim 1.0$ , but it leads to high ppl and vice-versa
- Contrastive Search
  - $0.6 < \alpha < 0.7$ leads to hum-level performance
Latency (Figure 5)
- Contrastive Search is more or less as efficient as beam search

Visualization

a. Very dense similarity matrix → token representations are not discriminative
b. A lot more sparse but still finds some dense regions along diagonal
c. The entire matrix is sparse and isotropic, “successfully solve degeneration”

Gunsoo Han

I'llpost reviews of NLP papers

이전 포스트

[Paper Review] SPECTER: Document-level Representation Learning using Citation-informed Transformers

1개의 댓글

olivia james

2023년 3월 11일

The key to successful self-care is to find activities that help you to de-stress and relax. This could be anything from going for a walk to taking a relaxing bath. It could be reading a book or watching a movie. It could be spending time with friends or taking up a hobby. Whatever it is, make sure it is something that brings you a sense of joy and peace. Embroidery

답글 달기