On the effectiveness of convolution autoencoders on image-based personalized recommendation[.,2020]

Sungchul Kim·2021년 12월 27일

Recommender-system

목록 보기

3/3

Introduction

최근 들어 개인화 추천시스템이 여러 플랫폼에서 적용되고 있습니다.
(Netflix, Youtube)

본 연구는 user, item, image 정보를 사용하여 user의 취향에 맞는 음식점을 추천하는 데 초점을 두었습니다. (Trip advisor data 사용)
저자는 image의 feature extractor로 Convolutional Autoencoder를 언급하였습니다.

Method

model architecture

User id, Restaurant id는 one-hot vector형태이고, Image는 convolutional encoder를 사용하여 표현됩니다. convolutional encoder는 추후 설명하도록 하겠습니다.

User id, Restaurant id, image는 같은 dimension으로 projection 시켜준 다음 concatenate 시켜줍니다.
3개의 vector는 다른 scale을 가지기 때문에 concatenate 이후 batch normalization을 적용하여 normalize 시켜줍니다.
Batch normalization 이후 FC Layer를 쌓아줍니다.
2개의 Reduce Block + FC Layer를 거쳐 output을 출력하게 됩니다. (output은 [0,1]입니다)

Reduce block structure

다음은 Reduce block입니다. Reduce block은 concatenate된 vector를 가지고 user, item, image각각에 대한 joint representation을 하기 위함이라 할 수 있습니다.

joint representation이란, image나 user의 review정보를 기반으로 item과 user의 latent representation을 학습하는 neural network라 할 수 있습니다.

Reduce block의 특징은 다음과 같습니다.

Input dimension을 받아 FC Layer를 거쳐 최종 output dimension은 input dimension의 절반
Dropout을 사용

Convolution Autoencoder

앞서 설명했던 architecture는 user, item와 visual information를 담고 있는 image를 합치는 방식으로 구현이 되어 있습니다.

일반적으로 AutoEncoder는 저차원( $h$ )으로 사영된 information을 사용하여 input data( $x$ )를 복원하는데 초점을 둔 모델입니다. 모델 구조는 크게 encoder, decoder로 이루어져 있습니다.

Notation은 아래와 같습니다.

$h = f_{\theta}(x) = \sigma(W_{x} + b)$

$\theta$ = { $W$ , $b$ }

$W$ = weight matrix

$b$ = bias vector

$\sigma$ = activation function

Convolution AutoEncoder도 AutoEncoder과 마찬가지로 encoder, decoder 구조로 이루어져 있습니다. encoder는 image( $x$ )를 convolution과 maxpooling layer를 거쳐 feature를 추출하고 저차원( $h$ )으로 사영시킵니다. decoder는 convolution과 uppooling layer를 거쳐 복원( $r$ ) 하는 데 초점을 두었습니다.

Convolution AutoEncoder의 최종적인 구조는 다음과 같습니다.

Notation

Building Block : 3x3 Conv + Batch Normalization + ReLU
Maxpooling layer : $dim$ /2
UpSampling layer : $dim$ x 2
$m$ : number of feature maps at the end the block or layer

Experiments

Data description

저자는 Dataset으로 Trip Advisor review data를 사용하였습니다. (수집된 기간 : 2018~2019)

TripAdvisor review data는 특정 국가의 도시에 있는 restaurant data를 사용하였습니다.

Santiago de Comstela (Spain), Barcelona (Spain), New York (USA)에 거주하는 특정 user, restaurants 그리고 user의 review정보를 보여줍니다. (여기서 review는 restaurant에서 파는 음식에 대한 review를 의미)

본 연구에서는 image 정보를 사용해야 한다는 전제하에 실험을 진행하였습니다.

즉, image 정보를 담고 있는 review 수는 각각 7003, 66904, 111415개입니다. 이때 각각의 review는 한 개 이상의 image 정보를 담고 있기 때문에 Total images는 각각 16168, 153707, 234689개입니다.

Total Images는 positive, negative로 구성되어 있습니다. 이때 positive는 1, negative는 0으로 명시 하였습니다. (like or dislike)

Imbalance data

위 테이블을 보시면 Positive sample과 Negative sample이 imbalance 하다는 사실을 알 수 있습니다. 저자는 이 문제를 해결하기 위해 train data에 대해 augmentation을 적용하였습니다.

→ Positive sample과 Negative sample의 비율을 1:1에 근접하게 만들어 imbalance 문제를 해결함.

모델에 따른 성능 비교 table

Convolution AutoEncoder
Pre-trained Resnet50 model을 사용하여 feature extract
Fine-tuned Resnet50 model

Sensitivity와 Specificity간의 trade off

본 연구에서는 성능 지표를 B-score 기준으로 비교하였습니다.
수식은 아래와 같습니다.

B-score $= 2 * \frac {sensitivity * specificity}{sensitivity + specificity}$

B-score 기준으로 비교하였을 때 Convolution AutoEncoder 성능이 제일 좋음을 알 수 있습니다.

Conclusion

결론적으로 Convolutional Autoencoder를 feature extractor로써 사용하여 개인화 추천에 있어 좋은 성능을 보였습니다. 본 연구는 data imbalance 문제를 해결하기 위해 image에 augmentation을 주었다는 점이 인상 깊었고, Reduce block을 통해 joint representation에 초점을 두었다는 점이 인상 깊었습니다.

Sungchul Kim

김성철

이전 포스트