[NLP] torchtext, spaCy를 이용하여 Vocab 만들기

nkw011·2022년 7월 30일

NLP spaCy torchtext

Natural Language Processing(NLP) 기초 내용, 실습, 논문 정리

목록 보기

7/15

torchtext, spaCy를 활용하여 Vocab을 만드는 실습을 해볼 것이다.

spaCy의 Tokenizer를 활용해서 vocab을 직접 구현해본다.
torchtext의 메소드를 활용해서 vocab을 만들어본다.

1. WikiText-2 데이터 불러오기

torchtext의 데이터셋인 WikiText-2를 사용하기 위해 데이터를 불러온다.

1.1. torchdata 설치

torchtext에서 데이터셋을 불러오려면 먼저 torchdata를 설치해야한다.
PyTorch의 version에 주의하여 적합한 version을 설치하면 된다.

torchdata Version Compatibility

torchdata를 설치할 때 아래와 같은 ERROR가 발생할 수 있다.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.

해결 방법

folium==0.2.1을 먼저 설치한 다음 torchdata를 재설치한다.
- torchdata가 잘못된 방법으로 했을 때 설치된 경우 torchdata를 삭제한 후 folium부터 다시 설치한다.
colab을 사용하고 있는 경우 런타임을 재실행한 다음 순서대로 설치하면된다.

!pip install folium==0.2.1

!pip install torchdata==0.4.0

!pip show torchdata

1.2. WikiText-2 데이터셋 불러오기

torchtext.datasets.WikiText2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))
- split을 활용하여 필요한 데이터셋만 가져올 수 있다.
  - 여기서는 train 데이터만 활용한다.
  - split='train'으로 train 데이터만 불러올 수 있다.

from torchtext.datasets import WikiText2

train = WikiText2(split='train')

데이터를 보면 \<unk>를 확인할 수 있는데 unknown token을 가리킨다.

for i, text in enumerate(train):
    if i == 5: break
    print(text)

 = Valkyria Chronicles III = 

 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 

 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .

2. spaCy Tokenizer를 이용하여 vocab 직접 구현하기

2.1. spaCy 데이터 설치

!python -m spacy download en_core_web_sm

2.2. Tokenizer 불러오기

import spacy
from spacy.symbols import ORTH

\<unk>라는 speical token이 있기 때문에 spacy tokenizer가 \<unk>를 하나의 token으로 인식할 수 있도록 special case를 추가해주어야한다.

Adding special case tokenization rules, Pattern format
ORTH를 이용하여 special_case라는 list에 special token을 추가한다.
tokenizer에 add_special_case를 이용하여 '\<unk>'를 만날 때 \<unk>로 토큰화를 진행할 수 있도록 한다.

spacy_en = spacy.load('en_core_web_sm')
special_case = [{ORTH:'<unk>'}]
spacy_en.tokenizer.add_special_case('<unk>', special_case)

# TEST
text = 'I use <unk> things.'

for token in spacy_en.tokenizer(text):
    print(token)

I
use
<unk>
things
.

2.3. Vocab 클래스 구현하기

Vocab 클래스는 다음과 같은 역할을 한다.

token2id를 이용하여 token이 어떤 id와 매핑되는지 저장한다.
id2token을 이용하여 id가 어떤 token과 매핑되는지 저장한다.
encode를 이용하여 문장을 토큰화하고 id값으로 바꾼다.
decode를 이용하여 id들이 있을 때 적합한 토큰들로 바꾼이후 원래 문장으로 복원한다.
special token과 매핑되는 id를 클래스 변수로 저장해서 사용한다.
- special token으로 unknown token을 사용한다.
- task에 따라 \<sos>, \<eos> 같은 special token을 추가로 사용할 수 있다.

※ spacy tokenizer의 반환값인 token을 사용하는 것보다 token.text가 더 정확한 결과를 반환한다.

from collections import Counter
from tqdm.notebook import tqdm

class Vocab:
    UNK_TOKEN = '<unk>'
    UNK_TOKEN_ID = 0

    def __init__(self, data, tokenizer, min_freq):
        self.data = [text for text in data]
        self.en = tokenizer
        self.id2token = list()
        self.token2id = dict()

        self.build_vocab(min_freq)

    def build_vocab(self, min_freq):

        counter = Counter()
        for tokens in tqdm(map(self.en.tokenizer, self.data), total=len(self.data), desc='Building Vocab'):
            counter.update(map(lambda x: x.text, tokens))

        self.id2token = [Vocab.UNK_TOKEN] + [ token for token, freq in counter.items() if freq >= min_freq and token != Vocab.UNK_TOKEN]
        self.token2id = { token:i for i, token in enumerate(self.id2token)}
    
    def encode(self, text):
        encoded = [self.token2id.get(token.text, UNK_TOKEN_ID) for token in self.en.tokenizer(text)]
        return encoded

    def decode(self, sequence):
        decoded = " ".join([self.id2token[token_id] for token_id in sequence])
        return decoded

corpus = Vocab(train, spacy_en, 3)

Building Vocab:   0%|          | 0/36718 [00:00<?, ?it/s]

len(corpus.token2id), len(corpus.id2token)

(33242, 33242)

corpus.token2id['<unk>'], corpus.id2token[0]

(0, '<unk>')

train_text = [text for text in train]

encoded = corpus.encode(train_text[4])

encoded

print(f"decode  : {corpus.decode(encoded)}")
print(f"original: {train_text[4]}")

decode  :   The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May ' n . 

original:  The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .

3. torchtext를 이용해서 vocab 만들기

torchtext의 get_tokenizer와 build_vocab_from_iterator를 사용하여 비교적 쉽게 vocab을 구성할 수 있다.

build_vocab_from_iterator()
- iterator를 이용하여 vocab을 만든다.
- parameter
  - iterator: vocab을 만들때 사용되는 iterator
  - min_freq: vocab에 포함되기 위한 최소 빈도수
  - specials: special token의 list
- torchtext.vocab.Vocab 클래스를 반환한다.

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

torch_tokenizer = get_tokenizer('basic_english')

torch_tokenizer('I use <unk> thing.')

['i', 'use', '<unk>', 'thing', '.']

torch_vocab = build_vocab_from_iterator(map(torch_tokenizer, train), min_freq=3, specials=['<unk>'])

build_vocab_from_iterator()은 torchtext.vocab.Vocab 클래스의 object를 반환한다. 반환된 Vocab object를 이용하여 아래와 같은 일들을 할 수 있다.

get_stoi(): token2id를 반환한다.
get_itos(): id2token을 반환한다.
__getitem__(): token에 매핑되는 id값을 반환한다.
lookup_token(): id에 매핑되는 token을 반환한다.
forward(): encode 한다.(문장을 토큰화하고 id값으로 바꾼다.)
- nn.Module의 forward()처럼 작동한다.
lookup_indices(): encode 한다.(문장을 토큰화하고 id값으로 바꾸어 바꾼다.)
lookup_tokens(): decode 한다.(id들을 적합한 토큰들로 바꾼다.)

# get_stoi(), get_itos()
p_token2id = torch_vocab.get_stoi()
p_id2token = torch_vocab.get_itos()

print(len(p_token2id.keys()), len(p_id2token))
print(p_token2id['<unk>'], p_id2token[0])

28782 28782
0 <unk>

# __getitem__, lookup_token()
torch_vocab['<unk>'], torch_vocab.lookup_token(0)

(0, '<unk>')

# 토큰화 테스트 문장
train_text = [text for text in train]

# forward(), lookup_indices()
encoded1 = torch_vocab(torch_tokenizer(train_text[4]))
encoded2 = torch_vocab.lookup_indices(torch_tokenizer(train_text[4]))

print(f"encoded1: {encoded1}")
print(f"encoded2: {encoded2}")

encoded1: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
encoded2: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]

# lookup_tokens()
decoded = torch_vocab.lookup_tokens(encoded1)
decoded_sentence = " ".join(decoded)
print(f"decoded: {decoded_sentence}")
print(f"original: {train_text[4]}")

decoded: the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . character designer <unk> honjou and composer hitoshi sakimoto both returned from previous entries , along with valkyria chronicles ii director takeshi ozawa . a large team of writers handled the script . the game ' s opening theme was sung by may ' n .
original:  The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .

nkw011

Deep Dive into Development (GitHub Blog: https://nkw011.github.io/)

이전 포스트

[NLP] NLTK, spaCy, torchtext를 이용하여 영어 토큰화(English Tokenization)작업 수행하기

다음 포스트