텍스트 전처리 및 단어 임베딩

이재경·2023년 1월 9일

딥러닝 인공지능 자연어처리

인공지능

목록 보기

8/14

데이터 내 단어 빈도수 계산

모든 데이터 분석에 앞서 더 정확한 모델링을 위해 데이터의 특징을 살펴보는 것이 중요합니다. 텍스트의 경우, 데이터 내 단어의 빈도수를 살펴보는 것으로 특징을 파악할 수 있습니다.

이번 실습에서는 영화 리뷰 데이터인 IMDB dataset에서 단어별 빈도수를 살펴볼 예정입니다.

지시사항
IMDB dataset이 들어 있는 text.txt 파일을 확인해 봅니다. 파일 내 각 줄은 하나의 리뷰에 해당합니다.

텍스트 데이터를 불러오면서 단어가 key, 빈도수가 value로 구성된 딕셔너리 변수인 word_counter를 만드세요.

파일 내 각 줄 끝에는 새로운 줄을 의미하는 특수기호(\n)가 추가되어 있습니다. rstrip() 함수를 이용하여 각 줄 맨 끝에 있는 특수기호를 제거하세요.
split() 함수를 사용하면 각 줄을 공백 기준으로 분리하여 단어를 추출할 수 있습니다.
word_counter를 활용하여, text.txt에 들어 있는 모든 단어의 빈도수 총합을 total 변수에 저장하세요.

word_counter를 활용하여, text.txt 내 100회 이상 발생하는 단어를 up_five 리스트에 저장하세요.

word_counter = dict()

# 단어가 key, 빈도수가 value로 구성된 딕셔너리 변수를 만드세요.
with open('text.txt', 'r') as f:
    for line in f:
        for word in line.rstrip().split():
            if word not in word_counter:
                word_counter[word]=1
            else:
                word_counter[word]+=1




print(word_counter)


# 텍스트 파일에 내 모든 단어의 총 빈도수를 구해보세요.
total = 0
for w,c in word_counter.items():
    total+=c



# 텍스트 파일 내 100회 이상 발생하는 단어를 리스트 형태로 저장하세요.
up_five = list()
for w,c in word_counter.items():
    if c>=100:
        up_five.append(w)



print(total)
print(up_five)

텍스트 전처리를 통한 데이터 탐색

모든 텍스트 분석에 앞서 텍스트 안에 어떠한 단어가 존재하는지 살펴보는 것이 중요합니다.
하지만 단순히 텍스트를 공백 기준으로 나눠 단어를 추출하면 여러 가지 문제점이 발생합니다.

동일한 의미를 가진 단어를 여러 방식으로 표현하여 사용하기 때문입니다. 예를 들어, computer라는 단어 또한 문장 내 위치에 따라 Computer와 같이 대문자로 표기하거나, computer.와 같이 특수기호와 함께 표기되기도 합니다.

이번 실습에서는 대소문자 및 특수기호를 제거하는 텍스트 전처리를 통해 데이터를 탐색해볼 것입니다.

지시사항
영화 리뷰를 불러오면서 모든 리뷰를 소문자 처리를 하고, 단어 내 알파벳을 제외한 모든 숫자 및 특수기호를 제거해 주세요.

문자열.lower(): 해당 문자열을 모두 소문자로 변환할 수 있습니다.
regex.sub('', 문자열): 문자열 내 regex 변수의 정규표현식에 해당하는 모든 문자를 제거(‘’로 교체)
전처리가 완료된 단어와 단어의 빈도수를 word_counter 딕셔너리에 저장하세요.
test.txt에 존재하는 단어 the의 빈도수를 count 변수에 저장하세요.

import re

word_counter = dict()
regex = re.compile('[^a-zA-Z]')

# 텍스트 파일을 소문자로 변환 및 숫자 및 특수기호를 제거한 딕셔너리를 만드세요.
with open('text.txt', 'r') as f: # 실습 1 과 동일한 방식으로 `IMDB dataset`을 불러옵니다.
    for line in f:
        words= line.rstrip().lower().split()
        for w in words:
            ww=regex.sub('', w)
            if ww not in word_counter:
                word_counter[ww]=1
            else:
                word_counter[ww]+=1




# 단어 "the"의 빈도수를 확인해 보세요.
count = word_counter["the"]

print(count)

NLTK를 통한 stopwords 및 stemming 처리

NLTK(Natural Language Toolkit) 은 텍스트 전처리 및 탐색 코드를 보다 빠르고 간편하게 작성할 수 있게 도와주는 Python 라이브러리입니다.

이번 실습에서는 NLTK를 활용하여, 문서의 여러 통계치를 계산하고 전처리된 데이터를 저장하는 실습을 진행해 보겠습니다.

지시사항
NLTK에서 기본적으로 제공되는 영어 stopword 를 stopwords 변수에 저장하세요.

new_keywords 리스트에 저장되어 있는 신규 stopword들을 1번에서 정의한 stopwords 변수에 추가하여 updated_stopwords에 저장해주세요.

test_sentences 내 각 문장을 단어 기준으로 토큰화 해주세요. 토큰화를 수행하면서 stopword에 해당되는 단어는 제거하고, 각 문장별 결과를 tokenized_word 리스트에 추가하세요. (이번 실습에서는 nltk의 함수인 word_tokenize를 통해 입력되는 문자열을 토큰화하고 있습니다)

PorterStemmer를 사용하여 토큰화된 test_sentences가 들어 있는 tokenized_word의 첫 문장에 stemming을 수행하고 결과를 stemmed_sent 리스트에 추가하세요.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

test_sentences = [
    "i have looked forward to seeing this since i first saw it amoungst her work",
    "this is a superb movie suitable for all but the very youngest",
    "i first saw this movie when I was a little kid and fell in love with it at once",
    "i am sooo tired but the show must go on",
]

# 영어 stopword를 저장하세요.
stopwords = stopwords.words('english')

print(stopwords)

# stopword를 추가하고 업데이트된 stopword를 저장하세요.
new_keywords = ['noone', 'sooo', 'thereafter', 'beyond', 'amoungst', 'among']
updated_stopwords = stopwords+new_keywords 

print(updated_stopwords)

# 업데이트된 stopword로 test_sentences를 전처리하고 tokenized_word에 저장하세요.
tokenized_word = []

for sentence in test_sentences:
    tokens = word_tokenize(sentence)
    new=[]
    for t in tokens:
        if t not in updated_stopwords:
            new.append(t)
    tokenized_word.append(new)
        




print(tokenized_word)

# stemming을 해보세요.
stemmed_sent = []
stemmer = PorterStemmer()
for w in tokenized_word[0]:
    stemmed_sent.append(stemmer.stem(w))



print(stemmed_sent)

word2vec으로 단어 유사도 측정
word2vec은 신경망을 통해 단어 임베딩 벡터를 학습합니다. 이번 실습에서는 파이썬 라이브러리인 gensim을 사용하여 word2vec을 학습하도록 하겠습니다.

학습데이터로는 개인의 감정을 표현하는 문장으로 구성된 Emotions dataset for NLP 데이터셋을 사용할 예정입니다.

지시사항
Emotions dataset for NLP 데이터셋을 불러오는 load_data 함수는 이미 작성되어 있습니다.

input_data에 저장되어 있는 텍스트 데이터를 사용해서 단어별 문맥의 길이를 의미하는 window는 2, 벡터의 차원이 300인 word2vec 모델을 학습하세요. (epochs는 10으로 설정)

단어 happy와 유사한 단어 10개를 similar_happy 변수에 저장하세요.

단어 sad와 유사한 단어 10개를 similar_sad 변수에 저장하세요.

good과 bad의 임베딩 벡터 간 유사도를 similar_good_bad 변수에 저장하세요.

sad와 lonely의 임베딩 벡터 간 유사도를 similar_sad_lonely 변수에 저장하세요.

happy의 임베딩 벡터를 wv_happy 변수에 저장하세요.

import pandas as pd
from gensim.models import Word2Vec

def load_data(filepath):
    data = pd.read_csv(filepath, delimiter=';', header=None, names=['sentence','emotion'])
    data = data['sentence']

    gensim_input = []
    for text in data:
        gensim_input.append(text.rstrip().split())
    return gensim_input

input_data = load_data("emotions_train.txt")

# word2vec 모델을 학습하세요.

m=Word2Vec(window=2,vector_size=300)
m.build_vocab(input_data)
m.train(input_data,total_examples=m.corpus_count,epochs=10)



# happy와 유사한 단어를 확인하세요.
similar_happy = m.wv.most_similar("happy")

print(similar_happy)

# sad와 유사한 단어를 확인하세요.
similar_sad = m.wv.most_similar("sad")
print(similar_sad)

# 단어 good과 bad의 임베딩 벡터 간 유사도를 확인하세요.
similar_good_bad =m.wv.similarity("good","bad")

print(similar_good_bad)

# 단어 sad과 lonely의 임베딩 벡터 간 유사도를 확인하세요.
similar_sad_lonely = m.wv.similarity("sad","lonely")


print(similar_sad_lonely)

# happy의 임베딩 벡터를 확인하세요.
wv_happy = m.wv["happy"]

print(wv_happy)

fastText로 단어 임베딩 벡터 생성
fastText는 word2vec의 단점인 미등록 단어 문제를 해결합니다. 이번 실습에서는 파이썬 라이브러리인 gensim을 사용하여 fastText를 학습하도록 하겠습니다.

학습 데이터로는 개인의 감정을 표현하는 문장으로 구성된 Emotions dataset for NLP 데이터셋을 사용하겠습니다.

지시사항
input_data에 저장되어 있는 텍스트 데이터를 사용해서 단어별 문맥의 길이를 의미하는 window는 3, 벡터의 차원이 100, 단어의 최소 발생 빈도를 의미하는 min_count가 10인 fastText 모델을 학습하세요.

epochs는 10으로 설정합니다.
단어 day와 유사한 단어 10개를 similar_day 변수에 저장하세요.

단어 night와 유사한 단어 10개를 similar_night 변수에 저장하세요.

elllllllice의 임베딩 벡터를 wv_elice 변수에 저장하세요.

from gensim.models import FastText
import pandas as pd

# Emotions dataset for NLP 데이터셋을 불러오는 load_data() 함수입니다.
def load_data(filepath):
    data = pd.read_csv(filepath, delimiter=';', header=None, names=['sentence','emotion'])
    data = data['sentence']

    gensim_input = []
    for text in data:
        gensim_input.append(text.rstrip().split())

    return gensim_input

input_data = load_data("emotions_train.txt")

# fastText 모델을 학습하세요.
ft_model=FastText(min_count=10,window=3,vector_size=100)
ft_model.build_vocab(input_data)
ft_model.train(input_data,total_examples=ft_model.corpus_count,epochs=10)
# day와 유사한 단어 10개를 확인하세요.
similar_day = ft_model.wv.most_similar("day")

print(similar_day)

# night와 유사한 단어 10개를 확인하세요.
similar_night = ft_model.wv.most_similar("night")

print(similar_night)

# elllllllice의 임베딩 벡터를 확인하세요.
wv_elice = ft_model.wv['elllllllice']

print(wv_elice)

이재경

코딩으로 빛나게

이전 포스트

딥러닝 모델 서비스하기

다음 포스트