[NLP] 벡터의 유사도 - (1) 코사인 유사도(Cosine Similarity)

김규리·2022년 6월 28일

NLP

목록 보기

21/33

벡터의 유사도(Vector Similarity)

사람이 인식하는 문서의 유사도: 문서들 간에 동일한 단어 또는 비슷한 단어가 얼마나 공통적으로 많이 사용되었는지

기계가 계산하는 문서의 유사도: 각 문서의 단어들을 어떤 방법으로 수치화하여 표현했는지(DTM, Word2Vec 등), 문서 간의 단어들의 차이를 어떤 방법(유클리드 거리, 코사인 유사도 등)으로 계산했는지

1. 코사인 유사도(Cosine Similarity)

: 두 벡터 간의 코사인 각도를 이용하여 구할 수 있는 두 벡터의 유사도
( -1 < S < 1 )

방향 완전 동일: 1
방향 90도: 0
방향 180도(정반대): -1

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))

doc1 = np.array([0,1,1,1])
doc2 = np.array([1,0,1,1])
doc3 = np.array([2,0,2,2])

print('문서 1과 문서2의 유사도 :',cos_sim(doc1, doc2))
print('문서 1과 문서3의 유사도 :',cos_sim(doc1, doc3))
print('문서 2와 문서3의 유사도 :',cos_sim(doc2, doc3))

문서 1과 문서2의 유사도 : 0.67
문서 1과 문서3의 유사도 : 0.67
문서 2과 문서3의 유사도 : 1.00

ㄴ 문서3은 문서2에서 모든 단어의 빈도수가 +1
ㄴ 벡터의 방향(패턴)이 같기 때문에 유사도 1(최대)
ㄴ 문서의 길이가 다른 상황에서 비교적 공정한 비교를 할 수 있도록 도와줌

2. 유사도를 이용한 추천 시스템 구현하기

TF-IDF와 코사인 유사도로 영화 줄거리에 기반에서 영화를 추천하는 시스템 만들기

# TF-IDF를 연산할 때 데이터에 Null 값이 들어있으면 에러가 발생

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['overview'])
print('TF-IDF 행렬의 크기(shape) :',tfidf_matrix.shape)

# TF-IDF 행렬의 크기(shape) : (20000, 47487)

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print('코사인 유사도 연산 결과 :',cosine_sim.shape)

# 코사인 유사도 연산 결과 : (20000, 20000)

# 영화의 타이틀을 key, 영화의 인덱스를 value로 하는 딕셔너리 title_to_index
title_to_index = dict(zip(data['title'], data.index))

# 영화 제목 Father of the Bride Part II의 인덱스를 리턴
idx = title_to_index['Father of the Bride Part II']

# 선택한 영화의 제목을 입력하면 코사인 유사도를 통해 가장 overview가 유사한 10개의 영화를 찾아내는 함수
def get_recommendations(title, cosine_sim=cosine_sim):
    # 선택한 영화의 타이틀로부터 해당 영화의 인덱스를 받아온다.
    idx = title_to_index[title]

    # 해당 영화와 모든 영화와의 유사도를 가져온다.
    sim_scores = list(enumerate(cosine_sim[idx]))

    # 유사도에 따라 영화들을 정렬한다.
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # 가장 유사한 10개의 영화를 받아온다.
    sim_scores = sim_scores[1:11]

    # 가장 유사한 10개의 영화의 인덱스를 얻는다.
    movie_indices = [idx[0] for idx in sim_scores]

    # 가장 유사한 10개의 영화의 제목을 리턴한다.
    return data['title'].iloc[movie_indices]
    
get_recommendations('The Dark Knight Rises')

12481                            The Dark Knight
150                               Batman Forever
1328                              Batman Returns
15511                 Batman: Under the Red Hood
585                                       Batman
9230          Batman Beyond: Return of the Joker
18035                           Batman: Year One
19792    Batman: The Dark Knight Returns, Part 1
3095                Batman: Mask of the Phantasm
10122                              Batman Begins
Name: title, dtype: object

김규리

connecting the dots

이전 포스트

[NLP] 카운트 기반 단어 표현 - (4) TF-IDF(Term Frequency-Inverse Document Frequency)

다음 포스트

[NLP] 벡터의 유사도 - (1) 코사인 유사도(Cosine Similarity)

NLP

벡터의 유사도(Vector Similarity)

1. 코사인 유사도(Cosine Similarity)

2. 유사도를 이용한 추천 시스템 구현하기

[NLP] 카운트 기반 단어 표현 - (4) TF-IDF(Term Frequency-Inverse Document Frequency)

[NLP] 벡터의 유사도 - (2) 여러가지 유사도 기법

0개의 댓글