By segmentating words into subwords with meanings, we can reduce the problem of OOV (Out -Of-Vocabulary), sparse words, or neologies.
BPE was firstly made for data compression.
It finds pairs of words that were frequently used, and change it to a 'byte' (or character). It repeats the process until there's no byte pair remaining for merge.
BPE in NLP is subword segmentation algorithm.
It uses bottom up approach; by creating vocabulary out of character units.
While BPE merges pairs based on frequency, WordPiece Tokenizer merges pairs that increase likelihood of the corpus.
Unigram language model tokenizer calculates the loss for each subword.
Loss of subword means the amount of likelihood of corpus being decreased when certain subword is removed from vocabulary.
Then, based on that calculated loss, it sorts subwords, and remove 10~20% tokens that have the most negative influence.
It repeats the process until it reaches the size of vocabulary.
Google made a sentencepiece using BPE algorithm and Unigram Model Tokenizer.
If word tokenizing data has to be preproceeded for internal word segmentation, it would be difficult to be adopted for every language.
Because languages like Korean (unlike English) are difficult to word tokenize.
If we can use word segmentation tokenizer without having to do pretokenization, it would be a tokenizer that can be used for every language.
Sentencepiece tokenize words without pretokenization as well.
SubwordTextEncoder is a subword tokenizer using tensorflow.
It uses wordpiece model.
It is a package developed by NLP startup Huggingface.
It considers frequent subwords as one token.
seq2seq is composed of two models: encoder and decoder.
Encoder gets every word from input sentence in order, and compress every word information into one vector, whilch is called context vector.
When input sentence information is compressed into context vector, encoder sends it to decoder.
Based on n-gram, it compares machine translation and human translation to evaluate the performance.
It is not limited by language, and is fast. Unlike PPL, in BLEU, higher the point is, better the performance is.