There are two types of word representation:
Local Representation (=Discrete Representation)
:Referencing the word only, and map specific value
-> It cannot entail nuance of the word
Distributed Representation (=Continuous Representation)
:Referencing surroundings of the word, and express it
-> It can entail nuance of the word
Bag of Words (=BoW) is Local Representation method.
It counts the frequency of the word, and quantify it.
BoW does not consider the order of words, but only focuses on frequency of it.
Below is the simplified mechanism of BoW.
If we switch row and column, we call it TDM.
It is a combination of BoWs from different documents into one matrix.
It is simple, but has some limitation below.
Sparse Representation
In One-Hot Vector, the size of vocabulary is the dimension of the vector. Therefore, most of the values are 0, which can increase calculation resource, and can cause spatial waste.
DTM shares the same problem. The size of the entire vocabulary is the dimension of the vector for DTM.
We call vectors with most of the values are 0 as 'sparse vector' or 'sparse matrix'.
In order to reduce the size of vocabulary, it is important to preprocess and regularize words.
Limitation of Frequency Based Method
There are both meaningful words and meaningless words. In order to prevent confusion made by meaningless words, we will a) remove stopwords, and b)give different weight for each word.
We will use TF-IDF to do so.
If we can calculate importance of each word, we will be able to consider more information compared to using mere DTM.
Note. TF-IDF is not always better than DTM.
TF-IDF uses word frequency and inverse document frequency to weigh words differently by their importance. We create DTM first, then use TF-IDF weights.
It can be used for
d = document
t = word (term)
n = number of documents
tf = word frequency in DTM
tf(d,t) = frequency of word 't' in the document 'd'
df(t) = number of documents with the word 't'
idf(d,t) = inverse of df(t)
The reason why we use log:
a) If we don't use log, the value of IDF will increase exponentially as 'n' increases
b) If we don't use log, we might give excessive weight to rare words
The reason why we add 1:
a) If we don't do it, there can be a case with denominator as 0 if specific word does not exist in the entire document
We can calculate document similarity using cosine similarity if words are quantified by BoW, DTM, TF-IDF, Word2Vec etc.
It signifies how similar the directions of two vectors are.
Using cosine angle,
a) if the directions of two vectors are same, cosine similairy is 1
b) if the angle of directions of two vectors is 90', cosine similarity is 0
c) if the angle of directions of two vectors is the opposite, cosine similarity is -1
Its range is -1 to 1.
Below is the formula of calculating cosine similarity for vector A and B.
If we do not use cosine similarity, longer document will have higher chance of getting higher similarity with other documents.
It is not useful compared to Jaccard similarity nor cosine similarity.
In multi-dimensional space, when point 'p' and 'q' has coordination of
below is the formula of calculating euclidean distance
If we assume it as two-dimensional space,
Above is the visualized distance between two points in coordinate plane.
Its range is 0 to 1.
If two sets are the same, jaccard similarity is 1.
If two sets are disjoint sets, jaccard similarity is 0.
Below is the formula 'J' for calculating jaccard similarity between set 'A' and set 'B'.