파이썬 영문 전처리

이주현·2023년 11월 30일

영문 전처리 파이썬

머신러닝

목록 보기

1/14

텍스트 전처리 (Text Preprocessing)

텍스트를 자연어 처리를 위해 용도에 맞도록 사전에 표준화 하는 작업
텍스트 내 정보를 유지하고, 중복을 제거하여 분석 효율성을 높이기 위해 전처리를 수행

1. 토큰화(Tokenizing)

텍스트를 자연어 처리를 위해 분리하는 것
토큰화는 단어별로 분리하는 "단어 토큰화"와 문장별로 분리하는 "문장 토큰화로 구분(토큰화)

2. 품사 장착 (Pos Tagging)

각 토큰에 품사 정보를 추가
분석시에 불필요한 품사를 제거하거나(예: 조사, 접속사 등) 필요한 품사를 필타링 하기 위해 사용

3. 개체명 인식(NER, Named Entity Recognition)

각 토큰의 개체구분(기관, 인물, 지역, 날짜 등) 태그를 부착
텍스트가 무엇과 관련되어 있는지 구분하기 위해 사용
예를 들어, 과일의 apple과 기업의 apple을 구분하는 방법이 개체명 인식

4. 원형복원(Stemming & Lemmatization)

각 토큰의 원형을 복원함으로써 토큰을 표준화 하여 불필요한데이터 중복을 방지(= 단어의 수를 줄일 수 있어 연산 효율성을 높임)
어간추출(Stemming): 품사를 무시하고 규칙에 기반하여 어간을 추출
표제어 추출(Lemmatization): 품사정보를 유지하여 표제어 추출

5. 불용어 처리(Stopword)

자연어 처리를 위해 불필요한 요소를 제거하는 작업
불필요한 품사를 제거하는 작업과 불필요한 단어를 제거하는 작업으로 구성
불필요한 토큰을 제거함으로써 연산의 효율성을 높임

영문 내용 실습

1.1 영문 기사 수집

import requests
from bs4 import BeautifulSoup

url = '기사 가져올 주소'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

eng_news = soup.select('p') 
eng_text = eng_news[3].get_text()

eng_text

1.2 영문 토큰화

!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('omw-1.4')

word_tokenize() = 단어와 구두점(온점, 컴마, 물음표, 세미콜론, 느낌표 등과 같은 기호 기준 자름)

import nltk
from nltk.tokenize import word_tokenize

token1 = word_tokenize(eng_text)
print(token1)
=> ['It', 'is', 'the', 'present-day', 'darling',
'of', 'the', 'tech']

WordPunctTokenizer() = 알파벳과알파벳이 아닌 문자를 구분하여 토큰화

import nltk
from nltk.tokenize import WordPunctTokenizer

wordpuncttoken = WordPunctTokenizer().tokenize(eng_text)
print(wordpuncttoken)
=>['It', 'is', 'the', 'present','-', 'day', 'darling', 
'of', 'the', 'tech']

TreebankWordTokenizer() = 정규표현식에 기반한 토큰화

import nltk
from nltk.tokenize import TreebankWordTokenizer

treebankwordtoken=TreebankWordTokenizer().tokenize(eng_text)
print(treebankwordtoken)
=> ['It', 'is', 'the', 'present-day', 'darling',
'of', 'the', 'tech']

1.3 영문 품사 부착(Pos Tagging)

분리한 토큰마다 품사를 부착한다 https://www.nltk.org/api/nltk.tag.html
태그목록: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

taggedToken = pos_tag(word_tokens)
print(taggedToken)
=> [('James', 'NNP'), ('is', 'VBZ'), ('working', 'VBG'),
('at', 'IN'), ('Disney', 'NNP'), 
('in', 'IN'), ('London', 'NNP')]

1.4 개체명 인식(NER, Named Entity Recognition)

http://www.nltk.org/api/nltk.chunk.html

개체명 인식이란 말 그대로 이름을 가진 개체를 인식하겠다는 것
어떤 이름을 의미하는 단어를 보고는 그단어가 어떤 유형인지를 인식하는 것

nltk.download('words')
nltk.download('maxent_ne_chunker')

from nltk import ne_chunk
neToken = ne_chunk(taggedToken)
print(neToken)

=> (S
  (PERSON James/NNP)
  is/VBZ
  working/VBG
  at/IN
  (ORGANIZATION Disney/NNP)
  in/IN
  (GPE London/NNP))
  James는 PERSON(사람), 
  Disney는 조직(ORGANIZATION),
  London은 위치(GPE)

1.5 원형복원

각 토큰의 원형을 복원하여 표준화한다
어간추출(Stemming)
규칙에 기반하여 토큰을 표준화
ing제거, ful제거 등
http://www.nltk.org/api/nltk.chunk.html

규칙 상세: http://www.nltk.org/api/nltk.chunk.html

from nltk.stem import PorterStemmer
ps = PorterStemmer()

print("running -> " + ps.stem("running"))
print("believes -> "+ps.stem('believes'))
print('using ->' + ps.stem("using"))
print("conversation ->" + ps.stem('conversation'))
print('organization ->'+ ps.stem('organization'))
print('studies -> '+ ps.stem("studies"))

=> 
running -> run
believes -> believ
using ->use
conversation ->convers
organization ->organ
studies -> studi

표제어 추출(Lemmatization)

품사 정보를 보존하여 토큰을 표준화
http://www.nltk.org/api/nltk.chunk.html

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()

print("running -> " + wl.lemmatize("running"))
print("believes -> "+wl.lemmatize('believes'))
print('using ->' + wl.lemmatize("using"))
print("conversation ->" + wl.lemmatize('conversation'))
print('organization ->'+ wl.lemmatize('organization'))
print('studies -> '+ wl.lemmatize("studies"))

=>
running -> running
believes -> belief
using ->using
conversation ->conversation
organization ->organization
studies -> study

불용어 처리(Stopword)

최빈어 조회 : 최빈어를 조회하여 불용어 제거 대상을 선정

stopPos = ["IN", "CC", "UH", "TO", "MD", "DT", "VBZ", "VBP"]

from collections import Counter
Counter(taggedToken).most_common()
=>

[(('its', 'PRP$'), 5),
 (('of', 'IN'), 3),
 (('the', 'DT'), 2),
 (('.', '.'), 2),
 (('(', '('), 2),
 (('AI', 'NNP'), 2),
 ((')', ')'), 2),
 (('It', 'PRP'), 1),
 (('is', 'VBZ'), 1),
 (('present-day', 'JJ'), 1),
 (('darling', 'NN'), 1),
 (('tech', 'JJ'), 1)]

stopWord = [",", "be", "able"]

word = []
for tag in taggedToken:
  if tag[1] not in stopPos:
    if tag[0] not in stopWord:
      word.append(tag[0])

print(word)
=>
['It', 'present-day', 'darling', 'tech', 
'world','.', 'current' 등등)

1.6 영문 텍스트 전처리 종합

import nltk
nltk.download('averaged_perceptron_tagger') #pos tagging
nltk.download('words') #NER
nltk.download('maxnet_ne_chuncker') #NER
nltk.download('wordnet') #Lemmatization

from nltk.tokenize import TreebankWordTokenizer
token= TreebankWordTokenizer().tokenize
("Obama loves fried chicken of KFC")
print('token:', token)

from nltk import pos_tag
TaggedToken = pos_tag(token)
print('tagged token:', TaggedToken)

from nltk.stem import PorterStemmer
ps = PorterStemmer()
print("loves -> " + ps.stem("loves"))
print("fried => " + ps.stem('fried'))

from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()
print("loves -> " + wl.lemmatize("loves"))
print("fried => " + wl.lemmatize('fried'))

#불용어 처리
StopPos = ['IN']
StopWord = ["fried"]

word = []
for tag in TaggedToken:
  if tag[1] not in StopPos:
    if tag[0] not in StopWord:
      word.append(wl.lemmatize(tag[0]))

print(word)
=>

token: ['Obama', 'loves', 'fried', 'chicken', 'of', 'KFC']
tagged token: [('Obama', 'NNP'), ('loves', 'VBZ'), 
('fried', 'VBN'), ('chicken', 'NN'), 
('of', 'IN'), ('KFC', 'NNP')]
loves -> love
fried => fri
loves -> love
fried => fried
['Obama', 'love', 'chicken', 'KFC']

이주현

Backend Delveloper

다음 포스트

파이썬 영문 전처리

머신러닝

텍스트 전처리 (Text Preprocessing)

1. 토큰화(Tokenizing)

2. 품사 장착 (Pos Tagging)

3. 개체명 인식(NER, Named Entity Recognition)

4. 원형복원(Stemming & Lemmatization)

5. 불용어 처리(Stopword)

영문 내용 실습

1.1 영문 기사 수집

1.2 영문 토큰화

word_tokenize() = 단어와 구두점(온점, 컴마, 물음표, 세미콜론, 느낌표 등과 같은 기호 기준 자름)

WordPunctTokenizer() = 알파벳과알파벳이 아닌 문자를 구분하여 토큰화

TreebankWordTokenizer() = 정규표현식에 기반한 토큰화

1.3 영문 품사 부착(Pos Tagging)

1.4 개체명 인식(NER, Named Entity Recognition)

1.5 원형복원

어간추출(Stemming)

표제어 추출(Lemmatization)

불용어 처리(Stopword)

1.6 영문 텍스트 전처리 종합

파이썬 한글 전처리

0개의 댓글