[논문 리뷰] Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017)

각 기호에 대한 설명 $\mathcal L_0$ : graph의 labeled 된 부분에 대한 supervised loss $f(\cdot)$ : 미분가능한 neural network $\lambda$ : weighing factor $X$ : node feature vectors $X_i$ 의 matrix $\Delta = D-A$ : undirected graph $\mathcal G=(\mathcal{V,E})$ 의 graph Laplacian

이 식은 연결된 노드가 같은 label를 공유하고 있다고 가정하지만, 이러한 가정은 모델의 능력을 제한한다. → edge가 graph similarity가 아니라 다른 정보를 담고 있을 수 있기 때문에.

본 연구에서 neural network mode $f(X,A)$ 를 직접적으로 사용해서 graph structure를 이용하고, supervised target $\mathcal L_0$ 를 label이 있는 모든 node에 대해서 학습해, loss function에서 직접적인 graph-based regularization를 피해 간다.

두 가지 기여
1. graph에서 직접적으로 작동하는 neural network model을 위한 단순하고 잘 작동하는 layer-wise propagation rule을 보이고, spectral graph convolutions의 first-order approximation에서 어떻게 유도되는가.
2. graph-based neural network가 어떻게 빠르고 scalable한 graph에서의 semi-supervised node classfication에 사용될 수 있는가.

Fast approximate convolutions on graphs

multi-layer Graph Convolutional Network (GCN)은 다음 layer-wise propagation rule을 따른다

H^{(l+1)}=\sigma(\tilde{D}^{\frac{1}{2}}\tilde{A}\tilde{D}^{\frac{1}{2}}H^{(l)}W^{(l)})

각 기호에 대한 설명 $\tilde{A}=A+I_N$ : self-connection이 더해진 adjacency matrix $\tilde{D}_{ii}=\sum_j \tilde{A}_{ij}$ $W^{(l)}$ : layer-specific한 학습 가능한 weight matrix $\sigma(\cdot)$ : activation function e.g. ReLU $H^{(l)}$ : $l$ 번째 layer의 activation matrix

이러한 propagation rule이 graph에 대한 localized spectral filters에서 first-order approximation을 통해 유도됨을 보인다.

Spectral Graph Convolutions

먼저, Spectral convolutions on graphs에서 시작한다.

g_\theta\star x = U g_\theta U^T x

위 수식을 계산하기 위해서는 Computation이 비싸기 때문에, Chebyshev polynomials를 사용해 다음과 같은 $K^{th}$ order까지 truncated expansion을 사용해 계산한다. Graph GCN 논문의 (4)번 수식 참고.

g_{\theta^{\prime}}(\Lambda) \approx \sum^K_{k=0} \theta^{\prime}_k T_k (\tilde{\Lambda})

여기서 $\tilde{\Lambda}=\frac{2}{\lambda_{max}}\Lambda - I_N$ 으로 rescaled된 값이다. 이를 신호 x에 대한 convolutional signal에 대한 정의로 돌아가 적용시키면 다음과 같다.

g_\theta \star x \approx \sum^K_{k=0} \theta^{\prime}_k T_k (\tilde{L})x

여기서 $\tilde{L}=\frac{2}{\lambda_{max}}L - I_N$ 이다.

Layer-Wise Linear Model

이제 본 논문의 핵심 부분이다. graph convolutions에 근거한 NN 모델은 여러개의 conv layer를 쌓아서 만들어진다. 우리는 이제 layer-wise conv를 $K=1$ 에 제한해서 적용시키겠다. 이는 graph Laplacian spectrum에 선형적인 함수가 된다.

이렇다면 우리는 여전히 여러 conv layers를 쌓으면서 풍부한 표현력을 유지하지만, Chebyshev 다항식이 주는 파라미터화에 제약되지 않는다. 이는 과적합 문제를 해소할 수 있다.

GCN에서 $\lambda_{max}\approx 2$ 로 가정한다. 원래는 계산해야 하지만 Neural Network가 깊어지면 다른 파라미터들이 이 역할을 대신 해 준다.

g_{\theta^\prime}\star x \approx \theta^{\prime}_0 x + \theta^{\prime}_1 (L - I_N)x =\theta^\prime_0 x - \theta^\prime_1 D^{-\frac{1}{2}}AD^{-\frac{1}{2}}x

여기서 두 개의 free parameter $\theta^\prime_0$ 와 $\theta^\prime_1$ 이 생긴다. 이 필터 파라미터들은 전체 graph들에서 공유되어 사용된다.

실용적인 관점에서 파라미터 수를 더 줄이는게 과적합 등의 문제 해결에 도움이 된다. 그렇기 때문에 $\theta=\theta^\prime_0=-\theta^\prime_1$ 로 정의해 다음과 같은 식을 얻는다.

g_\theta \star x \approx \theta(I_N + D^{-\frac{1}{2}} A D^{-\frac{1}{2}})x

그렇지만 이 연산도 기울기 소실/폭발 문제가 생긴다. 그래서 renormalization trick을 적용해서 최종적으로 다음과 같은 구조를 얻는다.

Z=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}X\Theta

이 filtering operation의 시간 복잡도는 $\mathcal{O}(|\mathcal{E}|FC)$ 이다.

Semi-supervised node classification

Graph에 적용될 수 있는 Convolutional Filter를 정의했다. 이제 원래 문제인 Semi-supervised 상황에 이를 적용해보자. 우리가 정의하는 모델은 다음과 같다.

Z=f(X,A)=\mathrm{softmax}(\hat{A} [\mathrm{ReLU}(\hat{A}XW^{(0)}) ]W^{(1)})

cross-entropy를 통한 학습에서 전체의 5%의 label만 가지고 성공적으로 node classfication을 해 냈다.

Limitation

Memory requirement → full-batch를 사용한다. mini-batch를 사용하는 방법 연구 필요
Directed edges and edge features → 지금은 undirected만 지원. 다만 방향을 나타내는 node를 추가한다면 활용될 수 있다.
Limiting assumptions → trade-off parameter $\lambda$ 를 도입하는 것이 좋은 경우도 있다.

Conclusion

Graph-structured data에서 semi-supervised classfication을 수행하는 독특한 방법 도입.

GCN model은 spectral convolutions on graph에 대한 first-order approximation을 기반으로 한 효과적인 layer-wise propagation rule을 사용했다.

저자가 뭘 해내고 싶어했는가?

Spectral Graph Processing을 first-order approximation한 model로 graph의 node에 대한 semi-supervised learning을 했다. 이는 Spatial Graph learning으로 이어지는 길목을 만들어 주었다.

이 연구의 접근에서 중요한 요소는 무엇인가?

Spectral conv에 대한 first-order approximation을 통한 layer-wise의 계산.

복잡한 계산은 Layer의 깊이가 깊어지면 해결되리라는 발상.

파라미터 수를 줄이기, 기울기 소실을 해결하기 위한 renomalization.

이해가 안 된 부분은 무엇인가?

큰 그림은 이해할 수 있었다. Spectral Graph Processing에서 시작했지만 결과적으로 익숙한 형태의 Deep learning이 되었기 때문에 Graph Signal Processing에 대한 지식 없이 이해할 수 없었다.

Graph GCN과 공통되고 Chebyshev polynomials을 사용해 truncated expansion을 하는 과정이 나오는데, 이 부분이 이해가 부족하다. GNN이 디지털 신호 처리와 관련이 있을 줄이야…

Layer-Wise에서 주황펜 부분은 잘 이해가 안 된다.

내가 참고하고 싶은 다른 레퍼런스에는 어떤 것이 있는가?

David K. Hammond, Pierre Vandergheynst, and Remi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.

가장 중요한 Layer-Wise linear model 부분은 딱히 참고한 Reference가 없는건가? 독창적인 발상이다.