Efficient implementation of Random walk with restart using Python

Hanseok Jo·2021년 5월 7일

PageRank RWR Random walk with restart python

1. Random walk with restart란?

random walk는 말 그대로 무작위적인 움직임을 얘기하며, 특히 그래프의 노드간의 무작위적인 움직임을 통해 특정 노드의 값을 다른 노드로 전달하는 알고리즘입니다. 전통적으로 node centrality1, node similarity2같이 노드의 속성을 정의하기 위해 사용되기도 하였으며 특히 요즘 많은 관심을 받고있는 semi-supervised learning 기법 중 label propagation3에도 사용이 되었습니다.

with restart는 out-bound가 없는 노드에서 빠져나오기 위해 사전에 정의된 확률로 원점으로 돌아온다는 의미입니다.

본문에서는 random walk with restart는 긴 표현 대신 RWR이라는 약어로 대신 하겠습니다.

1.1 RWR 공식과 알고리즘

RWR이 처음 제안된 논문4의 공식과 알고리즘을 간단히 표현하자면 다음과 같습니다.

\vec{r_t} = (1-p)A^T\vec{r_{t-1}} + p\vec{r_0}

여기서 $\vec{r_t}$ 는 $t$ iteration에서 각 노도 값의 vector이고, $\vec{r_0}$ 는 타겟 노드 혹은 시작 노드들이 1로 label된 binary vector, $A$ 는 column-normalized된 adjacency matrix, $p$ 는 0과 1 사이의 실수값으로 restart할 확률을 의미합니다.

RWR의 pseudo code는 다음과 같습니다.

Initialize r_0
Normalize the adjacenc matrix A by column
Initialize r_t = r_0
while (r_t has not converged)
	# A * r_t: matrix multiplication
	update r_t = (1-p)*A*r_t + pr_t

1.2 RWR과 page rank와의 관계

PageRank는 Google의 창립자인 Sergey Brin과 Larry Page가 만든 검색 알고리즘입니다. Google의 초기 검색 엔진에도 사용이 된 알고리즘이기도 합니다. 논문5의 page rank 공식과 알고리즘은 다음과 같습니다.

PageRank 공식

PageRank 알고리즘. 여기서 A는 row-normalized adjacency matrix입니다.

논문에 나온 알고리즘을 pseudo code로 옮기면 아래와 같습니다.

Initialize R_0
while (delta > epsilon)
	R_prev = R_curr
	R_curr = A * R_prev  # matrix multiplication
	d = l1_norm(R_prev) - l1_norm(R_curr)
	R_curr += d * E  # matrix multiplication
    epsilon = l1_norm(R_curr - R_prev)

RWR 공식과 알고리즘, PageRank의 공식과 알고리즘을 비교하면 다음과 같은 유사성을 발견할 수 있습니다.

	RWR	PageRank
adjacency matrix를 이용한 update	$A^T\vec{r_{t-1}}$	$\sum\limits_{v\in B_{u}}{R^{\prime}(v) \over {N_v}}$
사전에 정의된 vector를 보존	$\vec{r_0}$	$E(u)$
업데이트 할 값과 보존할 값을 weighted sum	$1-p$ , $p$	$1$ , $d$

RWR과 PageRank는 근본적으로 유사하기 때문에 RWR을 가장 쉽게 구현할 수 있는 방법은 구현된 PageRank를 적절히 사용하는 것입니다. Python에서 PageRank를 가장 쉽게 사용할 수 있는 방법은 네트워크(=그래프)를 다루는 패키지인 NetworkX6,7의 networkx.algorithms.link_analysis.pagerank_alg.pagerank (이하 nx.pagerank)입니다. NetworkX의 parameter를 수정해서 RWR을 구현해보도록 하겠습니다.

2. Random walk with restart 구현

2.1. nx.pagerank

nx.pagerank의 parameter8는 다음과 같습니다.

G (graph) – A NetworkX graph. Undirected graphs will be converted to a directed graph with two directed edges for each undirected edge.

alpha (float, optional) – Damping parameter for PageRank, default=0.85.

personalization (dict, optional) – The “personalization vector” consisting of a dictionary with a key some subset of graph nodes and personalization value each of those. At least one personalization value must be non-zero. If not specfiied, a nodes personalization value will be zero. By default, a uniform distribution is used.

max_iter (integer, optional) – Maximum number of iterations in power method eigenvalue solver.

tol (float, optional) – Error tolerance used to check convergence in power method solver.

nstart (dictionary, optional) – Starting value of PageRank iteration for each node.

weight (key, optional) – Edge data key to use as weight. If None weights are set to 1.

dangling (dict, optional) – The outedges to be assigned to any “dangling” nodes, i.e., nodes without any outedges. The dict key is the node the outedge points to and the dict value is the weight of that outedge. By default, dangling nodes are given outedges according to the personalization vector (uniform if not specified). This must be selected to result in an irreducible transition matrix (see notes under google_matrix). It may be common to have the dangling dict to be the same as the personalization dict.

각 paremeter와 RWR 공식에 대응하는 부분은 다음과 같습니다.

parameter	RWR
G	$A$
alpha	$1-p$
personalization	$\vec{r_0}$
nstart	$\vec{r_1}$
weight	$A$

나머지 parameter 중 max_iter, tol은 학습에 사용되는 공통된 parameter이고 dangling은 personalization과 중복되는 부분이 있어 생략 가능합니다.

2.2. nx.pagerank를 이용한 간단한 구현

일단 간단한 unweighted, undirected 랜덤 그래프를 생성해 후 0~9번 node에서 restart하는 RWR 코드는 다음과 같습니다.

import networkx as nx
rand_graph = nx.random_graphs.barabasi_albert_graph(n=3000, m=50)
rwr = nx.pagerank(rand_graph, personalization={i: 1 for i in range(10)})

NetworkX에서는 nx.pagerank 외에도 nx.pagerank_numpy , nx.pagerank_scipy 라는 함수로 pagerank를 사용할 수 있습니다. 사용하는 방법은 nx.pagerank와 동일합니다.

rwr_numpy = nx.pagerank_numpy(rand_graph, personalization={i: 1 for i in range(10)})
rwr_scipy = nx.pagerank_scipy(rand_graph, personalization={i: 1 for i in range(10)})

2.3. nx.pagerank 구현 방식에 따른 차이

jupyter notebook의 timeit을 이용하여 세 함수의 속도를 측정해보았습니다.

%%timeit
rwr = nx.pagerank(rand_graph, personalization={i: 1 for i in range(10)})
> 2.78 s ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
rwr = nx.pagerank_numpy(rand_graph, personalization={i: 1 for i in range(10)})
> 20.1 s ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
rwr = nx.pagerank_scipy(rand_graph, personalization={i: 1 for i in range(10)})
> 202 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

속도를 비교해보면 nx.pagerank_scipy > nx.pagerank> nx.pagerank_numpy 가 됩니다. 속도 차이가 많이 나는 이유는 각 함수가 구현된 방식을 비교해보면 알 수 있습니다. nx.pagerank는 각 노드를 순회하면서 값을 업데이트하는 iteration 방식, nx.pagerank_numpy는 eigenvalue와 eigenvector를 기반으로한 linear algebra 방식, nx.pagerank_scipy는 sparse matrix를 이용한 matrix multiplication 방식입니다. 그러다보니 eigenvalue, eigenvector를 구해야 하는 nx.pagerank_numpy가 연산이 가장 크고 느리며, iteration 방식보다 matrix multiplication 방식이 빠른건 명확하기 때문에 nx.pagerank_scipy가 nx.pagerank보다 빠릅니다.

여기서 끝나면 이 포스팅을 하는게 의미가 없겠죠? nx.pagerank의 iteration 부분을 multiprocessing으로 바꾸면 속도가 개선될 것 같은 생각이 들어서 nx.pagerank 내부를 뜯어보았습니다.

3. 효율적인 pagerank의 구현

3.1. nx.pagerank 뜯어보기

nx.pagerank의 원본 코드9 중 RWR과 관련없는 코드를 제거한 후 iteration별 시간과 총 시간을 측정해 보았습니다.

import time
def pagerank(G, personalization, alpha=0.85, max_iter=100, tol=1.0e-6, weight="weight"):
    start_time = time.time()
    profile_dict = {}
    
    if len(G) == 0:
        return {}, profile_dict

    if not G.is_directed():
        D = G.to_directed()
    else:
        D = G

    # Create a copy in (right) stochastic form
    W = nx.stochastic_graph(D)
    N = W.number_of_nodes()

    # Initialize scores
    x = dict.fromkeys(W, 1.0 / N)
    s = float(sum(personalization.values()))
    p = {k: v / s for k, v in personalization.items()}

    # power iteration: make up to max_iter iterations
    profile_dict["iter_time"] = []
    for iter_num in range(max_iter):
        iter_start_time = time.time()
        
        xlast = x
        x = dict.fromkeys(xlast.keys(), 0)
        for n in x:
            # this matrix multiply looks odd because it is doing a left multiply x^T=xlast^T*W
            for nbr in W[n]:
                x[nbr] += alpha * xlast[n] * W[n][nbr][weight]
            x[n] += (1.0 - alpha) * p.get(n, 0)

        iter_end_time = time.time()
        profile_dict["iter_time"].append(iter_end_time - iter_start_time)
        
        # check convergence, l1 norm
        err = sum([abs(x[n] - xlast[n]) for n in x])
        if err < N * tol:
            
            end_time = time.time()
            profile_dict["total_time"] = end_time - start_time
            profile_dict["iter_num"] = iter_num
            return x, profile_dict
    raise nx.PowerIterationFailedConvergence(max_iter)

rwr, report = pagerank(G=rand_graph, personalization={node: 1 for node in range(10)})
mean_iter_time = np.mean(report["iter_time"])
total_time = report["total_time"]
iter_num = report["iter_num"]
print(f"Time per iter: {mean_iter_time:.3f}\tTotal time: {total_time:.3f}\tNumber of iter: {iter_num}")
> Time per iter: 0.313	Total time: 2.655	Number of iter: 3

총 시간 2.65초 중 0.93초 정도가 iteration을 도는데 사용이 되었습니다. 함수가 동작하는 총 시간 중에 iteration에 사용된 시간보다 iteration 이전에 nx.Graph를 pagerank를 계산하기 위해 initialize 하는 부분이 더 오래 걸렸습니다. iteration 부분을 multiprocessing을 통해 1/9으로 줄인다고 해도 총 시간은 0.8초 정도 단축된 1.85초 정도라서 0.2초 정도 걸리는 nx.pagerank_scipy보다 여전히 느립니다.

iteration을 개선하기 위해 nx.pagerank를 뜯어보았더니 initialize에 걸리는 시간을 단축 시켜야 한다는 결론을 얻었습니다. nx.pagerank_scipy는 어떨까요?

3.2. nx.pagerank_scipy 뜯어보기

3.1과 같이 nx.pagerank_scipy의 원본 코드10 중 RWR과 관련없는 코드를 제거한 후 iteration별 시간과 총 시간을 측정해 보았습니다.

import scipy.sparse
def pagerank_scipy(G, personalization, alpha=0.85, max_iter=100, tol=1.0e-6, weight="weight"):    
    start_time = time.time()
    profile_dict = {}
    
    N = len(G)
    if N == 0:
        return {}, profile_dict

    nodelist = list(G)
    M = nx.to_scipy_sparse_matrix(G, nodelist=nodelist, weight=weight, dtype=float)
    S = np.array(M.sum(axis=1)).flatten()
    S[S != 0] = 1.0 / S[S != 0]
    Q = scipy.sparse.spdiags(S.T, 0, *M.shape, format="csr")
    M = Q * M

    # initial vector
    x = np.repeat(1.0 / N, N)

    # Personalization vector
    p = np.array([personalization.get(n, 0) for n in nodelist], dtype=float)
    p = p / p.sum()

    # power iteration: make up to max_iter iterations
    profile_dict["iter_time"] = []
    for _ in range(max_iter):
        iter_start_time = time.time()
        xlast = x
        x = alpha * x * M + (1 - alpha) * p
        iter_end_time = time.time()
        profile_dict["iter_time"].append(iter_end_time - iter_start_time)
        
        # check convergence, l1 norm
        err = np.absolute(x - xlast).sum()
        if err < N * tol:
            end_time = time.time()
            profile_dict["total_time"] = end_time - start_time
            profile_dict["iter_num"] = iter_num
            return dict(zip(nodelist, map(float, x))), profile_dict
    raise nx.PowerIterationFailedConvergence(max_iter)

rwr, report = pagerank_scipy(G=rand_graph, personalization={node: 1 for node in range(10)})
mean_iter_time = np.mean(report["iter_time"])
total_time = report["total_time"]
iter_num = report["iter_num"]
print(f"Time per iter: {mean_iter_time:.3f}\tTotal time: {total_time:.3f}\tNumber of iter: {iter_num}")
> Time per iter: 0.000	Total time: 0.371	Number of iter: 3

nx.pagerank_scipy 역시 initialize 하는 시간이 RWR을 계산하는 시간보다 더 오래 걸렸습니다. 기존 코도의 initialize 하는 부분에서 nx.Graph를 scipy의 sparse matrix로 변환하는 부분이 있는데 sparse matrix로 연산하는 것보다 numpy를 이용하는 것이 더 효율적이라고 알고 있어서 nx.Graph를 np.array로 변환하여 사용하는 그리고 RWR 공식에 맞는 변수명을 사용하여 새로운 initialize 함수를 만들어서 각 단계별로 시간을 측정해 보았습니다.

3.3. RWR initialization

def initialize_profiling(G, restart_nodes):
    t0 = time.time()
    A = nx.to_numpy_array(G)
    t1 = time.time()
    norm_A = A / A.sum(axis=0)[:,np.newaxis]
    t2 = time.time()

    personalization = {node: 1/len(restart_nodes) for node in restart_nodes}
    r_0 = np.array([personalization.get(n, 0) for n in range(len(G))])
    t3 = time.time()

    r_curr = np.repeat(1 / len(G), len(G))
    t4 = time.time()
    
    return t1-t0, t2-t1, t3-t2, t4-t3

initialized = initialize_profiling(rand_graph, restart_nodes=[i for i in range(10)])
t_convert_array, t_norm, t_r_0, t_repeat = initialized
print(t_convert_array, t_norm, t_r_0, t_repeat)
> 0.1689596176147461 0.023035526275634766 0.0005831718444824219 5.221366882324219e-05

이 코드에서는 nx.Graph를 np.array로 바꾸는데 시간이 가장 오래 걸렸습니다. t_r_0와 t_repeat 같은 경우는 시간을 단축시킬 방법이 보이지 않아 nx.to_numpy_array 함수11를 뜯어보았습니다. 전체 함수 코드 중 핵심적인 부분은 다음과 같습니다.

A = np.full((nlen, nlen), np.nan, order=order)
for u, nbrdict in G.adjacency():
	for v, d in nbrdict.items():
		try:
			A[index[u], index[v]] = d.get(weight, 1)
		except KeyError:
			# This occurs when there are fewer desired nodes than
			# there are nodes in the graph: len(nodelist) < len(G)
			pass

RWR 공식에서 사용되는 $A$ 는 column-normalized matrix인데 실제 계산에서는 $A^T$ 가 사용됨으로 결국 row-normalized matrix입니다. 노드 i에서 이웃 노드로 walk 할 때 노드 i의 값이 이웃 노드의 수만큼 normalize 된다는 의미입니다. 이 관점에서 보았을 때 나중에 np.sum(A, axis=0)을 이용하여 normalize 하지말고 이웃 노드의 weight의 정보를 담고 nbrdict를 순환하며 adjacency matrix를 만들 때 normalize 된 값을 받아오면 더 효율적일 것이라고 판단하여 initialize_profiling 함수를 다음과 같이 수정하였습니다.

def initialize_custom_adj_profiling(G, restart_nodes, is_weighted=False):
    t0 = time.time()
    norm_A = np.zeros(shape=(len(G), len(G)))
    if is_weighted:
        for i, neighbor_dict in G.adjacency():
            for j, v in neighbor_dict.items():
                norm_A[i][j] = v.get("weight", 1 / len(neighbor_dict))
    else:
        for i, neighbor_dict in G.adjacency():
            for j, v in neighbor_dict.items():
                norm_A[i][j] = 1 / len(neighbor_dict)
    t1 = time.time()

    personalization = {node: 1/len(restart_nodes) for node in restart_nodes}
    r_0 = np.array([personalization.get(n, 0) for n in range(len(G))])
    t2 = time.time()

    r_curr = np.repeat(1 / len(G), len(G))
    t3 = time.time()
    
    return t1-t0, t2-t1, t3-t2

initialized = initialize_custom_adj_profiling(rand_graph, restart_nodes=[i for i in range(10)])
t_convert_array, t_r_0, t_repeat = initialized
print(t_convert_array, t_r_0, t_repeat)
> 0.13943934440612793 0.0005867481231689453 5.3882598876953125e-05

기존의 initialize_profiling함수를 이용할 때보다 normalized matrix를 구하는 부분이 0.192초(0.169+0.023)에서 28%가 감소한 0.139초로 줄었습니다.

3.4. RWR with pure numpy

위의 삽질을 통해 기존의nx.pagerank보다 개선된 RWR 코드는 다음과 같습니다.

def custom_initialize(G, restart_nodes, is_weighted=False):
    norm_A = np.zeros(shape=(len(G), len(G)))
    if is_weighted:
        for i, neighbor_dict in G.adjacency():
            for j, v in neighbor_dict.items():
                norm_A[i][j] = v.get("weight", 1 / len(neighbor_dict))
    else:
        for i, neighbor_dict in G.adjacency():
            for j, v in neighbor_dict.items():
                norm_A[i][j] = 1 / len(neighbor_dict)

    personalization = {node: 1/len(restart_nodes) for node in restart_nodes}
    r_0 = np.array([personalization.get(n, 0) for n in range(len(G))])

    r_curr = np.repeat(1 / len(G), len(G))
    
    return{"A": norm_A, "r_0": r_0, "r_curr": r_curr, "N": len(G)}
def rwr_numpy(r_curr, r_0, A, N, alpha=0.85, max_iter=100, tol=1.0e-6):
    for iter_num in range(max_iter):
        r_prev = r_curr
        r_curr = alpha * r_curr @ A + (1 - alpha) * r_0
        err = np.absolute(r_curr - r_prev).sum()
        if err < N * tol:
            return r_curr
    return "NotConverged"

%%timeit
initialized = custom_initialize(rand_graph, restart_nodes=[i for i in range(10)])
r_curr = initialized["r_curr"]
r_0 = initialized["r_0"]
A = initialized["A"]
N = initialized["N"]

rwr = rwr_numpy(r_curr, r_0, A, N)
> 164 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

위 코드를 이용하여 성능을 측정해본 결과 기존 nx.pagerank 대비 6%(2.78s -> 164ms), 기존 nx.pagerank_scipy 대비 81%(202ms -> 164ms) 로 시간이 줄었습니다.

4. 결론

간단하게 세줄로 요약이 가능합니다.

RWR은 nx.pagerank로 쉽게 구현할 수 있지만 nx.pagerank_scipy가 더 빠릅니다.
nx.pagerank_scipy에서도 nx.Graph를 행렬로 변환하는 과정에서 비효율적인 부분이 있습니다.
비효율적인 코드를 고쳤더니 더 빨라졌습니다.

추가적으로 numpy로만 구현한 rwr_numpy에 numba의 @njit 데코레이터를 이용해서 성능 개선을 시도하였으나 별 차이가 없었습니다. 이와 관련해서 이유를 아시는분이 있다면 댓글로 남겨주시면 감사하겠습니다🙇‍♂️

5. References

Newman, Mark EJ. "A measure of betweenness centrality based on random walks." Social networks 27.1 (2005): 39-54.
Fouss, Francois, et al. "Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation." IEEE Transactions on knowledge and data engineering 19.3 (2007): 355-369.
and data engineering, vol. 19, no. 3, pp. 355–369, 2007.
Zhu, Xiaojin, and Zoubin Ghahramani. "Learning from labeled and unlabeled data with label propagation." (2002).
Pan, Jia-Yu, et al. "Automatic multimedia cross-modal correlation discovery." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004.
Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.
Hagberg, Aric, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using NetworkX. No. LA-UR-08-05495; LA-UR-08-5495. Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
https://networkx.org/documentation/stable/index.html
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html#networkx.algorithms.link_analysis.pagerank_alg.pagerank
https://networkx.org/documentation/stable/_modules/networkx/algorithms/link_analysis/pagerank_alg.html#pagerank
https://networkx.org/documentation/stable/_modules/networkx/algorithms/link_analysis/pagerank_alg.html#pagerank_scipy
https://networkx.org/documentation/stable/_modules/networkx/convert_matrix.html#to_numpy_array

Hanseok Jo

AITRICS에서 ML, CS, Statistics를 이용해서 Drug discovery를 하고 있습니다.

다음 포스트