G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment 정리

Plato·2024년 1월 4일

딥러닝

목록 보기

12/21

작업에 대한 설명
- ex) "You will be given one summary written for a news article. Your task is to rate the summary on one metric. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed."
평가 기준
- ex) "Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby ”the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.”"
평가자 LLM이 생성한 "Auto Chain-of-Thoughts"(Auto CoT)
- 작업에 대한 설명과 평가 기준 그리고 "Evaluation Steps:"를 LLM에 입력으로 제공함으로써, LLM이 평가 과정을 자동으로 생성하도록 만드는 프롬프트.
- 필자가 ChatGPT 3.5로 Auto CoT를 시도했을 때, 평가 과정을 얻을 수 없었다. 하지만 "Evaluation Steps:" 전에, "Generate evaluation steps"를 추가했을 때, 상세한 평가 과정을 출력했음.
점수만 출력하도록 제한하는 프롬프트
- "{placeholder}} (score only):"
  - placeholder는 평가 기준으로 대체
  - 논문에는 없지만 저자의 repo에서 해당 내용을 찾을 수 있었다

점수를 확률로 곱하여 더하는 것과 Auto CoT 둘 다 평가 성능에 긍정적인 영향을 미침
- LLM과 사람의 평가간의 상관 계수를 증가시킴
- GPT-4를 평가자로 사용할 때만 CoT를 프롬프트에 포함시킨 것으로 보임
  - Chiang and Lee (2023)는 GPT-4가 아닌 Chat-GPT로 Auto CoT를 시도했을 때 성능이 나빠지기도 했음을 지적함. 이를 고려할 때, GPT-4와 같이 매우 큰 모델을 사용하지 않으면 CoT로부터 얻는 이점은 미미할 것으로 보임.
UniEval하고 꽤 비슷한 성능을 보임
- UniEval은 T5를 평가 작업에 미세 조정한 모델
- T5는 제일 큰 모델이 110억 개 정도의 파라미터를 갖고 GPT-4는 6,000억 개에서 1조 개 정도의 파라미터를 갖는 것으로 추측됨을 고려하면, 여전히 in-context 학습할 때보다, 미세 조정할 때 성능 향상의 폭이 클 수 있음을 시사한다.
- Med-Prompt가 in-context 학습만으로 Med-PaLM2와 꽤 큰 격차를 보였던 것을 생각하면, 조금 아쉬운 부분이다. 다만 G-Eval보다 Med-Prompt가 더 다양한 in-context 학습 방법을 조합했으니, 미세 조정 없이 G-Eval의 성능을 더 높일 방법이 있을 수 있다.