단어 임베딩

Jane의 study note.·2022년 11월 30일

NLP 자연어처리

목록 보기

11/24

[18-1] CNN을 이용한 문장 분류 모델에 GloVe 벡터 적용

본 실습의 목표는 Convolutional Neural Network을 이용하여 문장을 여러 카테고리 중 하나로 분류하는 모델을 만드는 것입니다. 또한, 미리 학습된 단어 벡터를 모델에 적용하는 방법도 배워볼 것입니다.
학습 데이터는 Stanford 대학에서 구성한 공손함 데이터를 사용하겠습니다.

import os
import pandas as pd
import numpy as np
from collections import Counter
import nltk
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
import tensorflow as tf
from tensorflow.python.keras.preprocessing import sequence
from tensorflow import keras

nltk.download('punkt')

if not os.path.exists("Stanford_politeness_corpus.zip"):
  !wget http://www.cs.cornell.edu/~cristian/Politeness_files/Stanford_politeness_corpus.zip

if not os.path.exists("Stanford_politeness_corpus/wikipedia.annotated.csv"):
  !unzip Stanford_politeness_corpus.zip
  
def load_data(data_file):
  data = pd.read_csv(data_file)

  # Only use the top quartile as polite, and bottom quartile as impolite. Discard the rest.
  quantiles = data["Normalized Score"].quantile([0.25, 0.5, 0.75])
  print(quantiles)

  for i in range(len(data)):
    score = data.loc[i, "Normalized Score"]
    if score <= quantiles[0.25]:
      # Bottom quartile (impolite).
      data.loc[i, "Normalized Score"] = 0
    elif score >= quantiles[0.75]:
      # Top quartile (polite).
      data.loc[i, "Normalized Score"] = 1
    else:
      # Neutral.
      data.loc[i, "Normalized Score"] = 2

  data["Normalized Score"] = data["Normalized Score"].astype(int)

  # Discard neutral examples.
  data = data[data["Normalized Score"] < 2]
  data = data.sample(frac=1).reset_index(drop=True)

  return data
  
data = load_data("Stanford_politeness_corpus/wikipedia.annotated.csv")
pd.set_option('display.max_columns', None)

print(data.head())

다음으로 할 일은 사전을 구성하는 것입니다.

신경망의 입력으로 사용하기 위해서는 문장을 숫자로 바꿔야 하는데, 사전의 역할은 단어를 숫자로, 숫자를 단어로 바꿔주는 것입니다.

여기서 빠른 계산을 위해 dictionary 자료 구조를 사용하는 것이 일반적입니다.

문장들을 소문자로 바꾸고, tokenization (nltk.tokenize 패키지의 word_tokenize 함수 활용)
전체 데이터에서 각 토큰들의 등장 빈도 확인 (collections 패키지의 Counter 클래스 활용)
가장 등장 빈도가 높은 단어를 vocab_size 만큼 선택 (Counter의 most_common 함수 활용)
각각의 단어에 고유한 숫자 부여. 이때, 0번째 토큰은 "", 1번째 토큰은 "" 할당
토큰 -> 숫자 변환을 위한 dictionary (word_index 변수에 할당)와, 숫자 -> 토큰 변환을 위한 dictionary (word_inverted_index 변수에 할당) 생성

vocab_size = 5000
# we assign the first indices in the vocabulary to special tokens that we use
# for padding, and for indicating unknown words
pad_id = 0
oov_id = 1
index_offset = 1

def make_vocab(sentences):
  word_counter = Counter()

  for sent in sentences:
    tokens = word_tokenize(sent.lower())
    word_counter.update(tokens)
  
  most_common = word_counter.most_common()
  print("고빈도 단어:")
  for k, v in most_common[:10]:
    print(k, ": ", v)
  
  vocab = {
      '<PAD>': pad_id,
      '<OOV>': oov_id
  }
  for i, (word, cnt) in enumerate(most_common, start=index_offset+1):
    vocab[word] = i
    if len(vocab) >= vocab_size:
      break
  
  return vocab
  
sentences = data["Request"].tolist()
word_index = make_vocab(sentences)
word_inverted_index = {v:k for k, v in word_index.items()}

print("\n단어 사전:")
for i in range(0, 10):
  print(i, word_inverted_index[i])
  
print("\n단어 사전 크기: ", len(word_index))

사전이 잘 구성되었는지 시험해보겠습니다.

사전이 잘 구성되고, 각각의 사전이 word_index 변수와 word_inverted_index 변수에 할당되었다면 문장이 숫자로 변환되었다가 다시 원래 문장으로 돌아오는 것을 확인하실 수 있습니다.

def index_to_text(indexes):
  return ' '.join([word_inverted_index[i] for i in indexes])
  
def text_to_index(text):
  tokens = tokens = word_tokenize(text.lower())
  indexes = []
  for tok in tokens:
    if tok in word_index:
      indexes.append(word_index[tok])
    else:
      indexes.append(oov_id)
      
  return indexes

print("원본: ", sentences[0])
ids = text_to_index(sentences[0])
print("문자 -> 숫자: ", ids)
print("숫자 -> 문자: ", index_to_text(ids))

다음으로, 숫자로 바뀐 문장들을 학습 데이터로 사용할 수 있도록 변형하겠습니다.

모든 문장들을 동일한 길이가 되도록 padding 처리하거나 자름 (tensorflow.python.keras.preprocessing.sequence 패키지의 pad_sequence 함수 활용)
데이터의 일부(10%)를 테스트 데이터로 분리

x_variable = [text_to_index(sent) for sent in sentences]

sentence_size = 200
x_padded = sequence.pad_sequences(x_variable,
                                 maxlen=sentence_size,
                                 truncating='post',
                                 padding='post',
                                 value=pad_id)

n_test = len(data) // 10
test_inputs = x_padded[:n_test]
train_inputs = x_padded[n_test:]

ys = np.array(data["Normalized Score"].tolist())
test_labels = ys[:n_test]
train_labels = ys[n_test:]

print("test_inputs shape: ", test_inputs.shape)
print("train_inputs shape: ", train_inputs.shape)
print("test_labels shape: ", test_labels.shape)
print("train_labels shape: ", train_labels.shape)

이제 모델을 설계할 차례입니다.

keras.Sequential을 이용하여 CNN 모델을 구성해봅시다. Sequential 모델을 사용하려면 동일한 크기의 필터만 사용할 수 있습니다.

참고 함수:

keras.layers.Embedding

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

keras.layers.Conv1D

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D

keras.layers.GlobalMaxPool1D

https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool1D

keras.layers.Dense

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 50),
    keras.layers.Conv1D(32, 3, padding="same", activation=tf.nn.relu),
    keras.layers.GlobalMaxPool1D(),
    keras.layers.Dense(2, activation=tf.nn.softmax)
])
  
# 아래는 학습 결과를 시각화해주고, 성능을 측정하는 함수들입니다.  
def plot_loss(history):
  plt.figure(figsize=(6,5))
  val = plt.plot(history.epoch, history.history['val_loss'],
                 '--', label='Test')
  plt.plot(history.epoch, history.history['loss'], color=val[0].get_color(),
           label='Train')

  plt.xlabel('Epochs')
  plt.ylabel("Loss")
  plt.legend()

  plt.xlim([0,max(history.epoch)])
  
def eval_model(model):
  test_loss, test_acc = model.evaluate(test_inputs, test_labels)
  print('Test accuracy:', test_acc)  
  
# 만들어진 모델을 학습시켜보겠습니다.
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_inputs,
          train_labels,
          epochs=10,
          validation_data=(test_inputs, test_labels)
         )

plot_loss(history)
eval_model(model)

Pretrained word vectors

이번에는 만들어진 모델에 미리 학습된 단어 벡터를 적용해보겠습니다.

단어 벡터는 GloVe 벡터를 사용할 것입니다.

벡터 파일을 다운로드 받고 압축을 풀어보겠습니다.

파일이 어떻게 구성되어 있는지 볼까요?

if not os.path.exists('glove.6B.zip'):
    ! wget http://nlp.stanford.edu/data/glove.6B.zip
if not os.path.exists('glove.6B.50d.txt'):
    ! unzip glove.6B.zip
    
! head glove.6B.50d.txt

GloVe 벡터를 불러와서 임베딩 행렬을 초기화해보겠습니다.

GloVe 파일을 읽고, 각 줄에서 단어(1번째 토큰)와 벡터를 이루는 숫자들(2번째 이후 토큰들)을 분리
벡터를 이루는 숫자들을 numpy 행렬로 변환 (numpy의 asarray 함수 활용)
단어와 벡터를 연결하는 dictionary 자료구조 구성 (단어 -> 벡터)
모든 단어들에 대한 임베딩 행렬을 무작위로 생성 (vocab_size X 50 크기의 numpy 행렬)
임베딩 행렬에서, GloVe 벡터가 존재하는 단어들만 해당 GloVe 벡터로 대체

def load_glove_embeddings(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            w = values[0]
            vectors = np.asarray(values[1:], dtype='float32')
            embeddings[w] = vectors

    embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, 50))
    num_loaded = 0
    for w, i in word_index.items():
        v = embeddings.get(w)
        if v is not None and i < vocab_size:
            embedding_matrix[i] = v
            num_loaded += 1
    print('Successfully loaded pretrained embeddings for '
          f'{num_loaded}/{vocab_size} words.')
    embedding_matrix = embedding_matrix.astype(np.float32)
    return embedding_matrix

embedding_matrix = load_glove_embeddings('glove.6B.50d.txt')

glove_init = keras.initializers.Constant(embedding_matrix)

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 50, embeddings_initializer=glove_init),
    keras.layers.Conv1D(32, 3, padding="same", activation=tf.nn.relu),
    keras.layers.GlobalMaxPool1D(),
    keras.layers.Dense(2, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_inputs,
          train_labels,
          epochs=10,
          validation_data=(test_inputs, test_labels)
         )

plot_loss(history)
eval_model(model)

Jane의 study note.

이전 포스트

텍스트분류

다음 포스트

단어 임베딩

NLP 자연어처리

[18-1] CNN을 이용한 문장 분류 모델에 GloVe 벡터 적용

텍스트분류

합성곱 신경망

0개의 댓글