Introduction to NLP (Wk.9)

송종빈·2022년 1월 7일
0

introduction to nlp

목록 보기
7/9

Ch. 9 Word Embedding

9-1) Word Embedding

Sparse Representation

One-hot vectors made by one-hot encoding are called sparse vector.
In one-hot encoding, the index value of words that we want to express is 1, and others are all 0, this is called sparse representation.

The problem of it is that if we have more number of words, the dimension of vector increases unlimitedly.
More vocabulary we have, higher dimension the vector has.
It causes waste of space.
Not only the one-hot vector, but also other sparse representation method such as DTM shares the same problem.
It also cannot represent the meaning of the word.

Dense Representation

Dense representation, unlike sparse representation, it does not set the dimension of vector as the size of vocabulary. It fits every word's vectore representaton's dimension to what user has setted. Also, value of each word is not only 0 or 1, but can be actual number.

Word Embedding

The way of expressing words with dense vector isDI called word embedding.
The dense vector came from word embedding is called embedding vector.

9-2) Word2Vec

Sparse Representation

Refer to above

Distributed Representation

Sparse representation cannot represent the similarity of each word.
Distributed representation is under the distributional hypothesis.
It assumes that words in similar context has similar meaning.
Distributed representation learns text using distributional hypothesis, and distribute meaning of words into various dimension.

The dimension of vector decreases to lower dimension.

Continuous Bag of Words (CBOW)

Word2Vec method has two ways of learning: CBOW and Skip-Gram.
CBOW is predicting center word with context word.
Skip-Gram predicts context word with center word.

window is the range of words that the computer uses for prediction.
It creates dataset for learning by changing context word and center word, it is called 'sliding window'.

Word2Cec is not deep learning model, but a shallow neural network, whose hidden layer is only one.
Its hidden layer does not have an activation function, and it is for calculating lookup table, therefore it is also called projection layer.

Skip-Gram

It does not have the process of getting average of vector values.
It is known that skip-gram is stronger than CBOW.

NNLM

NNLM has adopted the concept of word embedding to get similarity between word vectors.
Word2Vec is better than NNLM in terms of learning speed and accuracy.

9-4) Skip-Gram with Negative Sampling (SGNS)

Negative Sampling

If the size of vocabulary is more than ten thousand, Word2Vec becomes too heavy model to learn.
Word2Vec performs the update of every word's embedding vector while backpropaganding.
It labels context context word as positive, and labels randomly sampled words as negative.
It is way more efficient than Word2Vec.

9-5) Global Vectors for Word Representation, GloVe

Introduction to GloVe

It uses both count-based and prediction-based methods.

Window Based Cooccurrence Matrix

Co-occurrence Probability

Loss Function

It makes the possibility of cooccrence of the inner product of embedded center word and context word vector.

9-6) FastText

Introduction to FastText

FastText is made by Facebook.
It is an extension of Word2Vec.
The biggest difference between Word2Vec and FastText is below:
Word2Vec considers the word as unsplittable unit
FastText considers there are various subwords inside one word
Therefore, it learns considering subword.

Out of Vocabulary, OOV

It can calculate the similarity with other words for out of vocabulary with same subwords.

Rare Word

FastText is strong for corpus with lots of noise as well.

9-8) Pre-Trained Word Embedding

Keras Embedding Layer

Embedding Layer is Lookup Table

9-9) Embeddings from Language Model, ELMo

Introduction to ELMo

It uses pre-trained language model.
Even if the spelling of word is the same, if we can differently perform word embedding by different context, we will be able to enhance the performance of natural language processing.
This idea is called contextualized word embedding.

Bidirectional Language Model, biLM

9-10) Embedding Visualization

profile
Student Dev - Language Tech & Machine Learning

0개의 댓글