[Paper Review] Multilingual machine translation with large language models: Empirical results and analysis

AFL·2024년 3월 6일

Papers

목록 보기

3/6

Multilingual machine translation with large language models: Empirical results and analysis

Wenhao Zhu, 29 Oct 2023

MMT 에 LLM 을 사용할 때의 advantages and challenges 를 분석했다. 다음의 두 질문으로 연구를 진행했다.

RQ1) How LLMs perform MMT over massive languages?
- LLM의 multilingual translation capability 계속 향상하고, GPT4 가 최고점 찍었다.
- commercial translation system 에 비해서는 LLM 이 떨어진다. (특히 low-resource 에서)

RQ2) Which factors affect the performance of LLMs?
- unreasonable instruction 이 주어져도 in context learning exemplar 가 주어지면 LLM 이 translation 을 할 수 있다. (MT 에서 ICL 의 중요성을 보여줌)
- cross-lingual translation pair 가 low-resource translation 에서 good exemplar 가 될 수 있다.
- LLM can acquire translation in a resource-efficient way and generate moderate translation even on zero-resource languages

scaling law of LLM:
- scale of neural parameter 와 traning data 가 증가할수록 LLM 은 강해진다.
emergent abilities:
- in-context learning enables LLM to learn target tasks according to the prompt without updating any parameters

dataset
Flores-101
LLMs
8가지 모델: XGLM-7.5B (Lin et al., 2022), OPT-175B (Zhang et al., 2022), BLOOMZ7.1B (Scao et al., 2022), Falcon-7B (Almazrouei et al., 2023), LLaMA2-7B (Touvron et al., 2023), LLaMA2-7B-chat (Touvron et al., 2023), ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023)
ICL strategy
in-context exemplars - 8 randomly-picked translation pairs from corresponding development set
in-context template - <X>=<Y>
supervised baselines
M2M-100-12B
NLLB-1.3B(distillation version)
google translator
metric
spBELU
COMET
SEScore

the multilingual translation capabilities of LLMs are continually improving
- LLM 중에는 gpt4 가 가장 높음

LLM 의 capability 가 언어에 따라 unbalance 되어있다.
- 영어가 아닌 언어로 translate 할 때 성능이 떨어짐
LLMs still lag behind the strong supervised baseline, especially on low-resource languages
- gpt4 가 잘하기는 하지만, NLLB 에 비하면 뒤떨어진다. 특히 low-resource 언어에서 더 그렇다.
public dataset 으로 LLM 을 evaluate 하기 전에 data leakage를 먼저 고려해야한다.
- BLOOMZ 모델이 instruction-tuninig 할 때 xP3 dataset 을 사용했는데, Frloes-200 데이터를 포함하고 있다. 따라서 training할 때 Flores-101 에 노출되었기 때문에 다른 LLM 모델에 비해서 성능이 더 높게 나온다.
- NEWS2023 데이터로 training data 와 겹치지 않는 데이터로 evaluate 해보면 BLOOMZ 모델의 성능이 확 떨어지는 것을 확인함

어떤 factor 가 LLM 의 번역 성능에 영향을 주는지 분석한다. XGLM-7.5B 모델을 기준으로 분석함

LLM은 resource-efficient 한 방식으로 번역할 수 있다.
- 성능과 corpus size 의 관계를 분석함.
- low-resource 언어에 대해 LLM 은 적은 non-English monolingual resource만 가지고 non-english 와 english 의 bilingual mapping 이 가능함.
- unseen language 에 대해서도 ICL 로 transate 할 수 있음
- => resource-efficient way 로 번역이 가능함

LLM 의 좋은 성능은 carefully-designed template 에 달렸다.
- template 에 따라 bleu scrore최대 16점까지 차이가 난다.
- <X>=<Y> 가 BLEU 가장 높음, [SRC]: <X>\n [TGT]: <Y> 가 가장 낮음
unreasonable template 도 LLM 을 instruct 할 수 있다.
- <X> can be translated to <Y> 로 해도 번역이 됨.
- <X> can be summarized as <Y>

cross lingual exemplars help for certain translation directions
- cross-lingual exemplar 로 De-En 을 쓰면 translation performance 가 떨어진다.
- cross-lingual exemplar 로 Zh-En 을 쓰면 translation performance 가 오른다
- => 더 넓은 범위의 task 에서 cross-lingual exemplar를 사용하는 것이 잠재적으로 유용할 수도 있음을 보여줌
semantically-related exemplars 를 사용한다고 해서 randomly-picked exemplar 보다 더 낫지는 않다.
- exemplar selection 으로 development set 을 썼다. → high quality candidate pool
- 4 ways of selecting in-context exemplars
  - Random
  - BM25
  - TopK
  - Oracle
- exempler 수에 따라 (1~8개) 위 네가지 selection strategy 의 BLEU score 추세를 그래프로 그림. exemplar 수가 증가할수록 BLEU score 증가하고, selection strategy 와 상관없이 translation performance 가 plateaus 에 도달함.
- exemplar 가 너무 많으면 예를 들어 32개 정도 bleu score 가 떨어진다.
  randomly-picked exemplar가 더 성능이 좋다.

exemplars teach LLM the core feature of translation task
- translation granularity of exemplar matters translation performance
- keeping in-context exemplars diverse is important
- LLM learns the core feature of translation task through in-context learning

exemplars in the tail of the prompt has more impact on the LLMs behavior
- exemplar 의 translation direction 을 반대로 바꾸면 LLM 이 fail 한다.

공부해서 남주자