등장 횟수 기반의 단어표현(Count-based Representation)

정현준·2022년 11월 22일

BOW CountVectorizer DTM TF-IDF TfidfVectorizer 유사도 자연어

1. CountVectorizer

import spacy
nlp = spacy.load("en_core_web_sm")

from sklearn.feature_extraction.text import CountVectorizer

# 예제 텍스트
text = """In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
tf–idf is one of the most popular term-weighting schemes today.
A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf."""

# 문장으로 이루어진 리스트를 저장합니다.
sentences_lst = text.split('\n')

# CountVectorizer를 변수에 저장(max_features=단어 개수)
vect = CountVectorizer(stop_words='english', max_features=100)

# 어휘 사전을 생성합니다.
dtm_count = vect.fit_transform(sentences_lst)

#데이터 프레임
dtm_count = pd.DataFrame(dtm_count.todense(), columns=vect.get_feature_names())

2. BoW : TfidfVectorizer(Term Frequency - Inverse Document Frequency)

# TF-IDF vectorizer. 테이블을 작게 만들기 위해 max_features=15로 제한하였습니다.
tfidf = TfidfVectorizer(stop_words='english', max_features=15)

# Fit 후 dtm을 만듭니다.(문서, 단어마다 tf-idf 값을 계산합니다)
dtm_tfidf = tfidf.fit_transform(sentences_lst)

dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())

파라미터 튜닝

def tokenize(document):
    doc = nlp(document)
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True) and (token.is_alpha == True)]
    
    """
    args:
        ngram_range = (min_n, max_n), min_n 개~ max_n 개를 갖는 n-gram(n개의 연속적인 토큰)을 토큰으로 사용합니다.
        min_df = n : int, 최소 n개의 문서에 나타나는 토큰만 사용합니다.
        max_df = m : float(0~1), m * 100% 이상 문서에 나타나는 토큰은 제거합니다.
    """

tfidf_tuned = TfidfVectorizer(stop_words='english'
                        ,tokenizer=tokenize
                        ,ngram_range=(1,2)
                        ,max_df=.7
                        ,min_df=3
                       )

dtm_tfidf_tuned = tfidf_tuned.fit_transform(df['reviews.text'])
dtm_tfidf_tuned = pd.DataFrame(dtm_tfidf_tuned.todense(), columns=tfidf_tuned.get_feature_names())

3. 유사도

from sklearn.neighbors import NearestNeighbors

# dtm을 사용히 NN 모델을 학습시킵니다. (디폴트)최근접 5 이웃.
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_tfidf_amazon)

# 2번째 인덱스와 거리가 가까운 인덱스
nn.kneighbors([dtm_tfidf_amazon.iloc[2]])

정현준

이전 포스트

텍스트 전처리

다음 포스트

등장 횟수 기반의 단어표현(Count-based Representation)

1. CountVectorizer

2. BoW : TfidfVectorizer(Term Frequency - Inverse Document Frequency)

3. 유사도

텍스트 전처리

이미지 불러오기

0개의 댓글