NLP의 sub task③: Extractive Summarization

제목없음·2022년 2월 23일

이번 포스팅에선 NLP의 세부 분야 중 Extractive Summarization에 대해 간단히 탐구한다.

NLG란?

Natural language generation (NLG) is a software process that produces natural language output.Common applications of NLG methods include the production of various reports, for example weather and patient reports;image captions;and chatbots.
-위키백과-

기계가 인간이 사용하는 언어를 생성하는 것
축약, 보강, 재구성 등으로 나뉨
챗봇 및 리포트에 사용

해당 게시물에선 축약의 한 종류인 extractive summarization을 알아본다.

📌Extractive Summarization

Given a document, selecting a subset of the words or sentences which best represents a summary of the document.
-papers with code-

Suummarization에는 두 종류가 있다.

Abstractive Summarization: 새로운 단어로 문장을 생성
Extractive Summarization: 기존 문서에서 추출하여 요약

🧐문제정의

Extractive Summarization, 즉 추출 요약은 앞서 언급했듯 주어진 문서에서 그 문서를 대표하는 문장을 그대로 가져오는 것 이다.

적용 가능 서비스

하루에도 특정 분야의 기사 또는 논문이 쏟아져나오는 경우, 이들을 기반으로 주기적인 hot issue를 탐지할 수 있는 서비스

💾데이터 소개: CNN/Daily Mail

CNN/Daily Mail은 뉴스 기사로 이루어진 데이터셋
286,817개의 훈련 corpus, 13,368개의 검증 corpus, 11,487개의 테스트 corpus로 구성
<eos> 테그로 두 하이라이트 사이의 경계를 나타냄
entity는 사전 처리로 익명화

🏆SOTA 모델 소개①: HAHSum

메인 아이디어

소스 문서를 encoding
문장 parsing을 기반으로 가능한 compression 식별
compression score를 매겨 최종 요약 생성

🏆SOTA 모델 소개②: MatchSum

메인 아이디어

문서, 후보 요약, 실제 요약을 벡터화하여 공간 상에 투영
후보 요약 중, 실제 요약과 문서 간의 거리만큼 가까운 요약을 찾고자 함

[이미지 출처]
Nallapati, Ramesh, et al. "Abstractive text summarization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023 (2016).

Zhong, Ming, et al. "Extractive summarization as text matching." arXiv preprint arXiv:2004.08795 (2020).

제목없음

안녕하세요:)

이전 포스트