Introduction to NLP (Wk.11)

송종빈·2022년 1월 28일

introduction to nlp

목록 보기

8/9

Ch.13 Subword Tokenizer

13-1) Byte Pair Encoding

Introduction to BPE

By segmentating words into subwords with meanings, we can reduce the problem of OOV (Out -Of-Vocabulary), sparse words, or neologies.
BPE was firstly made for data compression.
It finds pairs of words that were frequently used, and change it to a 'byte' (or character). It repeats the process until there's no byte pair remaining for merge.

BPE in NLP

BPE in NLP is subword segmentation algorithm.
It uses bottom up approach; by creating vocabulary out of character units.

WordPiece Tokenizer

While BPE merges pairs based on frequency, WordPiece Tokenizer merges pairs that increase likelihood of the corpus.

Unigram Language Model Tokenizer

Unigram language model tokenizer calculates the loss for each subword.
Loss of subword means the amount of likelihood of corpus being decreased when certain subword is removed from vocabulary.
Then, based on that calculated loss, it sorts subwords, and remove 10~20% tokens that have the most negative influence.
It repeats the process until it reaches the size of vocabulary.

13-2) SentencePiece

Introduction to SentencePiece

Google made a sentencepiece using BPE algorithm and Unigram Model Tokenizer.
If word tokenizing data has to be preproceeded for internal word segmentation, it would be difficult to be adopted for every language.
Because languages like Korean (unlike English) are difficult to word tokenize.
If we can use word segmentation tokenizer without having to do pretokenization, it would be a tokenizer that can be used for every language.
Sentencepiece tokenize words without pretokenization as well.

13-3) SubwordTextEncoder

Introduction to subwordtextencoder

SubwordTextEncoder is a subword tokenizer using tensorflow.
It uses wordpiece model.

13-4) Huggingface Tokenizer

Introduction to Huggingface Tokenizer

It is a package developed by NLP startup Huggingface.
It considers frequent subwords as one token.

Ch.14 Encoder-Decoder using RNN

14-1) Sequence-to-Sequence

Introduction to seq2seq

seq2seq is composed of two models: encoder and decoder.
Encoder gets every word from input sentence in order, and compress every word information into one vector, whilch is called context vector.
When input sentence information is compressed into context vector, encoder sends it to decoder.

14-3) Bilingual Evaluation Understudy Score

Introduction to BLEU

Based on n-gram, it compares machine translation and human translation to evaluate the performance.
It is not limited by language, and is fast. Unlike PPL, in BLEU, higher the point is, better the performance is.

송종빈

Student Dev - Language Tech & Machine Learning

이전 포스트

Introduction to NLP (Wk.9)

다음 포스트

Introduction to NLP (Wk.11)