numeric or word representation
feature extraction
(feature) encoding
(word) embedding
word representation
numerical representation
vectorization
- categorical data를 one-hot encoding으로 분류 정리
1 dimensional array : vector로 변경
- bag-of-word : 빈도 기반 단어 표현
ex) BoW: [1,1,1,2]
- TF-IDF : 빈도 기반 단어 표현(문서 기반)
Term Frequency X Inverse Document Frequency
(Term == Word) 특정 단어가 등장한 문서의 개수
- Word2Vec : (dense) embedding
- neural network model
- 큰 dataset으로 (a 1.6 billion words data set) word embedding을 학습
- "의미" == similar words?
- "king" - "man" = "queen"
Model : CNN RNN LSTM GRU
Methodology :
sequence-to-sequence
encoder-decoder structure
- Bert
- Transfer learning
- Pre-training(Self-supervised Learning) and Fine-tuning(Supervised Learning)
글 잘 봤습니다.