Distributed Representation

seongyong·2021년 6월 16일

fasttext word2vec

NLP

목록 보기

2/4

학습내용

Embedding

단어를 고정 길이의 벡터, 즉 차원이 일정한 벡터로 나타내는 것을 의미함.

임베딩 방법으로 count based representation, distributed representation 등이 있다.

Distributed Representation

벡터로 표현하고자 하는 타켓 단어가 해당 단어 주변 단어에 의해 결정되는 방법이다. Word2Vec, fastText 등이 여기에 해당한다.

단어 벡터를 이렇게 정하는 이유는 '비슷한 위치에서 등장하는 단어들은 비슷한 의미를 가진다'라는 분포가설 때문이다.

이 분포 가설에 기반하여 주변 단어 분포를 기준으로 단어의 벡터 표현이 결정되기 때문에 분산 표현(Distributed representation)이라고 부르게 된다.

1. Word2vec

CBow(Continuous Bag-of-words) : 주변 단어에 대한 정보를 기반으로 중심 단어의 정보를 예측
Skip-gram : 중심 단어의 정보를 기반으로 주변 단어의 정보를 예측

일반적으로 skip-gram의 성능이 CBow보다 좋다고 알려져있는데 이는 backpropogation 과정에서 훨씬 많은 학습을 진행할 수 있기때문으로 생각할 수 있다. 그에 따른 계산량도 skip-gram이 더 크다.

Input으로는 원핫인코딩된 특정 단어의 벡터가 들어가게되고, label 값은 특정 단어 주위에 있는 단어의 원핫인코딩 값으로 설정되게 된다. input과 label은 각각 하나의 단어로 설정되고 이를 단어마다 반복한다. 만약 이전에 학습시킨 단어가 또다시 나오게 된다면 이전까지 학습되어있던 가중치들에서 추가적으로 새로운 label을 학습시키게된다. 결과적으로 생성된 hidden layer에서의 가중치가 단어의 임베딩 값이다. 따라서 hidden layer에서의 노드개수가 임베딩 벡터의 차원이 된다.

hidden layer는 activation function을 갖지않으며 output layer에서는 softmax를 사용한다.

학습 효율 높이기
- Sub-sampling
- Negative-sampling
실습

gensim 패키지 사용

#! pip install gensim --upgrade

import gensim.downloader as api
wv = api.load('word2vec-google-news-300') #구글 뉴스 말뭉치로 학습된 word2vec 벡터를 다운

wv.index_to_key #인덱스에 위치한 단어 확인

wv['king'] #단어 king의 임베딩 벡터 확인

wv['cameron'] #구글 뉴스 말뭉치에 등장하지 않는 단어 입력시 KeyError

wv.similarity(w1, w2) #두 단어 w1, w2 사이의 similarity를 계산

car 벡터에 minivan 벡터를 더한 벡터와 가장 유사한 5개의 단어 찾기, negative로 빼기도 가능

for i, (word, similarity) in enumerate(wv.most_similar(positive=['car', 'minivan'], topn=5)):
    print(f"Top {i+1} : {word}, {similarity}")

.doesnt_match 메서드를 사용하여 가장 관계 없는 단어 뽑기

wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car'])

2. fastText

Word2Vec 방식에 철자(character)기반의 임베딩 방식을 더해준 새로운 임베딩 방식

fastText가 고안된 이유는?
- OOV(out of vocabulary) 문제 : Word2Vec는 코퍼스에 등장하지않은 단어에 대해서는 임베딩이 불가하다. 또한 적게 등장하는 단어에 대해서는 학습이 적게 일어나기 때문에 적절한 임베딩 벡터를 생성해내지 못한다는 것도 단점이다. fastText는 철자 단위 임베딩으로 이들을 극복한다.
철자 단위 임베딩(Character level Embedding)
- 맞벌이라는 단어를 예시로 들면,
  - 맞선, 맞절, 맞대다
  - 벌다, 벌어, 벌고
  - 먹이, 깊이
  등의 단어들의 철자를 학습시켜 맞벌이라는 단어의 의미를 유추할 수 있다. fastText는 이와 같이 철자가 가지는 정보도 학습시키는 것. 해당 단어를 3-6개 Character-level로 잘라서 임베딩을 적용. 알고리즘이 매우 효율적으로 구성되어 있기 때문에 시간상으로 Word2Vec과 엄청난 차이가 나진 않는다.

fasttext1

실습

gensim 이용

from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)

night라는 단어와 nights라는 단어가 각각 사전에 있는지 확인, night는 존재 / nights는 존재X

ft = model.wv
print(ft)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print(f"night => {'night' in ft.key_to_index}")
print(f"nights => {'nights' in ft.key_to_index}")

fastText는 코퍼스에 존재하지않는 nights까지도 임베딩가능

print(ft['nights'])

fastText 단점 : 임베딩 벡터는 단어의 의미보다는 결과 쪽에 조금 더 비중을 두고 있음

print(ft.doesnt_match("night noon fight morning".split()))

결과 : noon

의미상으로 fight지만 noon이 선택되는 것을 볼 수 있음.

임베딩 벡터를 사용하여 문장 분류 수행하기

Baseline : 문장에 있는 단어 벡터를 모두 더한 뒤에 평균내어 구하는 방법을 사용

모듈생성 및 데이터 준비

import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, Flatten
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.datasets import imdb

tf.random.set_seed(42) # 시드 설정

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=20000) #데이터셋 split, index형태로 저장되어있음

인덱스를 통해 단어불러오기 및 문장 생성함수 구현

word_index = imdb.get_word_index() #word, key 순서로 데이터를 받아옴
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) #key, word 순서로 변환

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text]) #text들을 ' '로 모두 합쳐 문장을 생성, default : ?

문서 생성 및 tokenizing

sentences = [decode_review(idx) for idx in X_train] #문서 저장

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

vocab_size = len(tokenizer.word_index) + 1 #왜 +1이 붙었을까?

token들의 sequence로 문서들을 변환

X_encoded = tokenizer.texts_to_sequences(sentences #tokenizer를 통해 생성된 token들의 인덱스로 sentences를 표현한 것

max_len=max(len(sent) for sent in X_encoded)
print(max_len)

print(f'Mean length of train set: {np.mean([len(sent) for sent in X_train], dtype=int)}')

padding

X_train=pad_sequences(X_encoded, maxlen=400, padding='post')
#지금 sequence마다 길이가 다르기에 padding을 통해 길이를 통일시켜줌.
#뒤쪽에 0을 채우고 싶으면 padding='post'를 인자로 작성
#패딩의 길이를 제한하고 싶다면 maxlen ='number' 인자 추가
#0말고 다른 숫자로 패딩하고 싶으면 value = number 인자 추가

y_train=np.array(y_train)

vocab에 속하는 단어에 대해서만 만들어지도록 설정

embedding_matrix = np.zeros((vocab_size, 300))

print(np.shape(embedding_matrix))

wv 단어에 대해 임베딩된 벡터값 저장

def get_vector(word):
    """
    해당 word가 word2vec에 있는 단어일 경우 임베딩 벡터를 반환
    """
    if word in wv:
        return wv[word]
    else:
        return None
 
for word, i in tokenizer.word_index.items():
    temp = get_vector(word)
    if temp is not None:
        embedding_matrix[i] = temp

모델학습

model = Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=max_len, trainable=False))
model.add(GlobalAveragePooling1D()) # 입력되는 단어 벡터의 평균을 구한다.
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(X_train, y_train, batch_size=64, epochs=20, validation_split=0.2)

test셋 형태 맞추기

test_sentences = [decode_review(idx) for idx in X_test]

X_test_encoded = tokenizer.texts_to_sequences(test_sentences)

X_test=pad_sequences(X_test_encoded, maxlen=400, padding='post')
y_test=np.array(y_test)