๐ŸŽฒ[์•Œ๊ณ ๋ฆฌ์ฆ˜] Edit distance

์ด๊ฐ•์šฑยท2022๋…„ 10์›” 16์ผ
1

๐ŸŽฒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋ชฉ๋ก ๋ณด๊ธฐ
4/5
post-thumbnail

ํŽธ์ง‘ ๊ฑฐ๋ฆฌ

๊ฐœ๋…

In computational linguistics and computer science, edit distance is a string metric(์ธก์ •๋ฒ•), i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.
์ถœ์ฒ˜ : https://en.wikipedia.org/wiki/Edit_distance

ํŽธ์ง‘๊ฑฐ๋ฆฌ๋Š” ๋ง๊ทธ๋Œ€๋กœ ํ•˜๋‚˜์˜ ๋ฌธ์ž์—ด์„ ๋‹ค๋ฅธ ๋ฌธ์ž์—ด๋กœ 'ํŽธ์ง‘'ํ•˜๊ธฐ ์œ„ํ•œ ๊ฑฐ๋ฆฌ=๋น„์šฉ์„ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ์จ ๋‘ ๋ฌธ์ž์—ด์ด ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ์ง€(=๊ฐ™์€์ง€)๋ฅผ ํŒ๋‹จํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

์ •์˜์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ ๋ฌธ์ž์—ด๋ผ๋ฆฌ์˜ ์œ ์‚ฌ์„ฑ์„ ํŒ๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ์ŠคํŽ ๋ง ์ฒดํฌ๋ฅผ ํ•˜๊ฑฐ๋‚˜, ์ƒ๋ฌผ์ •๋ณดํ•™์—์„œ ์—ผ๊ธฐ์„œ์—ด์˜ ์œ ์‚ฌ์„ฑ์„ ํŒ๋‹จ๋Š”๋ฐ ์“ฐ์ด๊ธฐ๋„ ํ•œ๋‹ค.

์ŠคํŽ ๋ง ์ฒดํฌ์˜ ์ข‹์€ ์˜ˆ.

์ข…๋ฅ˜

ํŽธ์ง‘์˜ ์ข…๋ฅ˜์—๋Š” ํฌ๊ฒŒ 3๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค. ์‚ฝ์ž…(Insertion), ์‚ญ์ œ(Deletion), ๋Œ€์ฒด(Substitution).

  1. Insertion of a single symbol.
    If a=uva = uv, then inserting the symbol xx produces uxvuxv.
    This can also be denoted ฮตโ†’xฮตโ†’x, using ฮตฮต to denote the empty string.
  2. Deletion of a single symbol changes uxvuxv to uv(xโ†’ฮต)uv (xโ†’ฮต).
  3. Substitution of a single symbol xx for a symbol yโ‰ xy โ‰  x changes uxvuxv to uyv(xโ†’y)uyv (xโ†’y).

์ด๋“ค ์ค‘์— ์–ด๋–ค ํŽธ์ง‘์—ฐ์‚ฐ์„ ํ—ˆ์šฉํ•˜๋Š” ์ง€์— ๋”ฐ๋ผ Edit Distance ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ข…๋ฅ˜๊ฐ€ ํŒŒ์ƒ๋œ๋‹ค.

  • The Levenshtein distance allows deletion, insertion and substitution.
  • The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
  • The Hamming distance allows only substitution, hence, it only applies to strings of the same length.
  • The Damerauโ€“Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters.
  • The Jaro distance allows only transposition.

์ด ๊ธ€์—์„œ๋Š” ์‚ฝ์ž…, ์‚ญ์ œ, ๋Œ€์ฒด๋ฅผ ํ—ˆ์šฉํ•˜๋Š” Levenshtein distance์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ,
์‚ฝ์ž…, ์‚ญ์ œ๋งŒ์„ ํ—ˆ์šฉํ•˜๋Š” longest common subsequence (LCS) distance ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์•Œ์•„๋ณธ๋‹ค.

์˜ˆ์‹œ

๋จผ์ € ๊ฐ„๋‹จํ•œ ํŽธ์ง‘ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ๋‘˜์„ ๋น„๊ตํ•ด๋ณด์ž.
์ผ๋‹จ ์—ฌ๊ธฐ์„œ๋Š”, ๊ฐ๊ฐ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ตœ์†Œํ•œ์˜ ํŽธ์ง‘ํšŸ์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•œ๋‹ค.

The Levenshtein distance
"kitten" โžก๏ธ "sitting" distance is 3.

  1. kitten โ†’ sitten (substitute "s" for "k")
  2. sitten โ†’ sittin (substitute "i" for "e")
  3. sittin(ฮตฮต) โ†’ sitting (insert "g" at the end)

LCS distance
"kitten" โžก๏ธ "sitting" distance is 5.

  1. kitten โ†’ (ฮตฮต)itten (delete "k" at 0)
  2. (ฮตฮต)itten โ†’ sitten (insert "s" at 0)
  3. sitten โ†’ sitt(ฮตฮต)n (delete "e" at 4)
  4. sitt(ฮตฮต)n โ†’ sittin (insert "i" at 4)
  5. sittin(ฮตฮต) โ†’ sitting (insert "g" at 6)

์ด์ฒ˜๋Ÿผ ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅธ๋ฐ, ์“ฐ์ž„์ƒˆ๊ฐ€ ์„œ๋กœ ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅด๋‹ค.

  • ๋ฆฌ๋ฒค์Šˆํƒ€์ธ์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋Š” edit distance์˜ ๋Œ€ํ‘œ๊ฒฉ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
    ๊ฐ ์—ฐ์‚ฐ์˜ cost๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜์—ฌ ๊ฐ€์žฅ ๋น„์šฉ์ด ์ ๊ฒŒ ๋“œ๋Š” ํŽธ์ง‘์˜ ๋น„์šฉ=๊ฑฐ๋ฆฌ์„ ๊ตฌํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  • LCS ํŽธ์ง‘๊ฑฐ๋ฆฌ๋Š” ๋„์ถœ๋œ ํŽธ์ง‘๊ฑฐ๋ฆฌ ๊ทธ ์ž์ฒด๋ณด๋‹ค๋Š”, ํ•ด๋‹น ๊ณผ์ •์„ ํ†ตํ•ด ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋‘ ๋ฌธ์ž์—ด๊ฐ„์˜ '์ตœ์žฅ ๊ณตํ†ต ๋ถ€๋ถ„ ์ˆ˜์—ด'์˜ ๊ธธ์ด์™€ ๊ทธ ์ˆ˜์—ด์„ ๊ตฌํ• ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค.

๋ฆฌ๋ฒค์Šˆํƒ€์ธ์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ๋จผ์ € ์•Œ์•„๋ณด์ž.

Levenshtein distance

S = 'strong' โžก๏ธ T = 'stone'์˜ ๊ฒฝ์šฐ๋ฅผ ์˜ˆ๋กœ ๋“ค์–ด๋ณด์ž.
๋งŒ์ผ ๊ฐ ์ ‘๋‘๋ถ€(prefix)์— ๋Œ€ํ•ด ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ๋ฏธ๋ฆฌ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด,
์˜ˆ๋ฅผ ๋“ค์–ด 'stro' โžก๏ธ 'sto'์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ๋ฏธ๋ฆฌ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด ๋‚จ์€ ๋ฌธ์ž์—ด('ng'์™€ 'ne')์— ๋Œ€ํ•œ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ์ฐพ์Œ์œผ๋กœ์จ, ์ฃผ์–ด์ง„ ์ž…๋ ฅ๊ฐ’ ์ „์ฒด์— ๋Œ€ํ•œ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

์˜ˆ์ƒ๋Œ€๋กœ DP ๋ฅผ ํ†ตํ•ด ๋ถ€๋ถ„๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด ๋‚˜๊ฐ€๋ฉด์„œ ๊ณ„์‚ฐ๋œ๋‹ค. ๋ถ€๋ถ„๋ฌธ์ œ์˜ ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

E[i,j]E[i,j] ๋Š” SS์˜ ์•ž์—์„œ๋ถ€ํ„ฐ ii๊ฐœ์˜ ๋ฌธ์ž๋ฅผ, TT์˜ ์•ž์—์„œ๋ถ€ํ„ฐ jj๊ฐœ์˜ ๋ฌธ์ž๋กœ ๋ณ€ํ™˜์‹œํ‚ค๋Š” ๋ฐ ํ•„์š”ํ•œ ์ตœ์†Œ ํŽธ์ง‘๊ฑฐ๋ฆฌ์ด๋‹ค.

E[4,3]E[4,3] ('stro' โžก๏ธ 'sto') ๋ฅผ ๊ตฌํ•˜๋ ค๊ณ  ํ•œ๋‹ค๋ฉด ๋‹ค์Œ์˜ 3๊ฐ€์ง€ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  1. E[4,2]E[4,2] ('stro' โžก๏ธ 'st')์˜ ๊ฐ’์„ ์•Œ๊ณ ์žˆ๋‹ค๋ฉด, ๋‹ค์‹œ ๋งํ•ด 'stro'๋ฅผ 'st'๋กœ ๋ฐ”๊พธ๋Š” ์ตœ์†Œ์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ์ด๋ฏธ ์•Œ๊ณ ์žˆ๋‹ค๋ฉด, ๊ทธ ํŽธ์ง‘๊ฑฐ๋ฆฌ์— ์‚ฝ์ž… ์—ฐ์‚ฐ์˜ ๋น„์šฉ ํ•˜๋‚˜๋งŒ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.
    'stro' โžก๏ธ 'st' ํ•œ ํ›„, T์ชฝ์— 'o'ํ•˜๋‚˜๋งŒ ์‚ฝ์ž…ํ•ด ์ฃผ๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

  2. E[3,3]E[3,3] ('str' โžก๏ธ 'sto')์˜ ๊ฐ’์„ ์•Œ๊ณ ์žˆ๋‹ค๋ฉด, ๋‹ค์‹œ ๋งํ•ด 'str'๋ฅผ 'sto'๋กœ ๋ฐ”๊พธ๋Š” ์ตœ์†Œ์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ๋ฅผ ์ด๋ฏธ ์•Œ๊ณ ์žˆ๋‹ค๋ฉด, ๊ทธ ํŽธ์ง‘๊ฑฐ๋ฆฌ์— ์‚ญ์ œ ์—ฐ์‚ฐ์˜ ๋น„์šฉ ํ•˜๋‚˜๋งŒ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.
    S์ชฝ 'stro'์˜ 'o'ํ•˜๋‚˜๋งŒ ๋นผ์ค€ ์ƒํƒœ์—์„œ 'str' โžก๏ธ 'sto' ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

  3. E[3,2]E[3,2] ('str' โžก๏ธ 'st')์˜ ๊ฐ’์„ ์•Œ๊ณ ์žˆ๋‹ค๋ฉด, ๊ธฐ์กด S์˜ ๋๋ฌธ์ž๋ฅผ ๊ธฐ์กด T์˜ ๋๋ฌธ์ž๋กœ ๋Œ€์ฒดํ•˜๋Š” ํŽธ์ง‘ ๋น„์šฉ์„ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด๋œ๋‹ค. 'str' โžก๏ธ 'st' ํ•˜๊ณ  ํ•ด๋‹น ๋Œ€์ฒด๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 'o'๋ฅผ 'o'๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋น„์šฉ, ์ฆ‰ 0์„ ์ถ”๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ๊ฐ€๋งŒํžˆ ๋‘ฌ๋„ ๋˜๊ธฐ ๋•Œ๋ฌธ.

์ด๋ฅผ ๋„์‹ํ™” ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์ด์ œ ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ์ด์šฉํ•ด dpํ…Œ์ด๋ธ”์„ ์ฑ„์›Œ ๋‚˜๊ฐ€๋ฉด์„œ ์ „์ฒด ์ž…๋ ฅ์— ๋Œ€ํ•œ ํ•ด๋ฅผ ๊ตฌํ•ด๋ณด์ž.
์•„๋ž˜๋Š” ์ ํ™”์‹๊ณผ ์ „์ฒด์ ์ธ ์ฝ”๋“œ์˜ ์ง„ํ–‰์ด๋‹ค.
์ฒ˜์Œ ํ…Œ์ด๋ธ”์˜ ์ดˆ๊ธฐํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜๋ฉด ๋œ๋‹ค.

S = ๊ธฐ์กด ๋ฌธ์ž์—ด, T = ๋ชฉํ‘œ ๋ฌธ์ž์—ด, m = S์˜ ๊ธธ์ด, n = T์˜ ๊ธธ์ด

  • ๐ธ[๐‘–,0] = ๐‘–, ๐ธ[0,๐‘—] = ๐‘— for ๐‘– = 0,1,โ€ฆ,๐‘š and ๐‘— = 0,1,โ€ฆ,๐‘›
  • For ๐‘– = 1,2,โ€ฆ,๐‘š and ๐‘— = 1,2,โ€ฆ,๐‘›
    ๐ธ[๐‘–, ๐‘—] = min{ ๐ธ[๐‘– - 1, ๐‘—] + 1, ๐ธ[๐‘–, ๐‘— - 1] + 1, ๐ธ[๐‘– โˆ’ 1, ๐‘— โˆ’ 1] + ๐›ผ}
    where ๐›ผ = 1 if S๐‘–=T๐‘—S_๐‘–=T_๐‘— 0 otherwise

    (์•„๋ž˜๋Š” ๋Œ€์ฒด์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ๋‹ค๋ฅธ ๋ฐฉํ–ฅ๋“ค๊ณผ์˜ ๋น„๊ต ์—†์ด ๋ฌด์กฐ๊ฑด ๋Œ€์ฒด์—ฐ์‚ฐ์œผ๋กœ ์ง„ํ–‰ํ•˜๋Š” ์ ํ™”์‹. ์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด LCS์™€์˜ ํ˜ธํ™˜์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฆ‰ ๋ฆฌ๋ฒค์Šˆํƒ€์ธ์œผ๋กœ ๊ณ„์‚ฐ์„ ํ•˜๊ณ  back-tracing์œผ๋กœ LCS์˜ ๊ธธ์ด๋ฅผ ์˜ค๋‹ต์—†์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.)
for i 0 to m E[i, 0] = i 	// 0 ๋ฒˆ ์—ด์˜ ์ดˆ๊ธฐํ™”
for j= 0 to n E[0 ,j] = j 	// 0 ๋ฒˆ ํ–‰์˜ ์ดˆ๊ธฐํ™”

for i 1 to m
	for j= 1 to n
		E[i, j] = min{E[i,j-1] + 1 , E[i-1 ,j] + 1 , E[i-1,j-1] + a}
        									
return E[m, n]

dp

m*n ํฌ๊ธฐ์˜ ํ…Œ์ด๋ธ”์„ ํ•˜๋‚˜์”ฉ ์ฑ„์›Œ๋‚˜๊ฐ€์•ผ ํ•˜๋ฏ€๋กœ ์‹œ๊ฐ„ ๋ณต์žก๋„๋Š” ฮ˜(mn)ฮ˜(mn)์ด ๋œ๋‹ค.

ฮตฮต์—์„œ T์˜ ๊ฐ ๋ถ€๋ถ„ ๋ฌธ์ž์—ด('s','st, ... ,'stone')๋กœ ํŽธ์ง‘ํ•˜๋Š” ๋น„์šฉ์€ ๊ทธ๋ƒฅ ๊ทธ ๋ฌธ์ž์—ด์˜ ๊ธธ์ด๋งŒํผ ์‚ฝ์ž…ํ•ด์ฃผ๋Š” ๋น„์šฉ์ด๋‹ค. ๋”ฐ๋ผ์„œ j = 1,2,...,T์˜ ๊ธธ์ด ์ผ๋•Œ E[0,j] = j ์ด๋‹ค.
๋งˆ์ฐฌ๊ฐ€์ง€๋กœ S์˜ ๊ฐ ๋ถ€๋ถ„ ๋ฌธ์ž์—ด์—์„œ ฮตฮต๋กœ ํŽธ์ง‘ํ•˜๋Š” ๋น„์šฉ์€ ๊ทธ ๋ฌธ์ž์—ด์˜ ๊ธธ์ด๋งŒํผ ์‚ญ์ œ ํ•ด์ฃผ๋Š” ๋น„์šฉ์ด๋‹ค. ๋”ฐ๋ผ์„œ i = 1,2,...,S์˜ ๊ธธ์ด ์ผ๋•Œ E[i,0] = i ์ด๋‹ค.

์ดํ›„์— ์ ํ™”์‹์— ๋”ฐ๋ผ ํ…Œ์ด๋ธ”์„ ์ฑ„์›Œ๋‚˜๊ฐ€๋‹ค๋ณด๋ฉด, dp[m][n]์ด ์ „์ฒด ์ž…๋ ฅ๊ฐ’์— ๋Œ€ํ•œ ์ตœ์†Œ ํŽธ์ง‘ ๊ฑฐ๋ฆฌ๊ฐ€ ๋œ๋‹ค.

back-tracing

์ตœ์†Œ ํŽธ์ง‘๊ฑฐ๋ฆฌ ๊ฐ’ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ฐ ๊ณผ์ •์—์„œ์˜ ์—ฐ์‚ฐ์˜ ์ข…๋ฅ˜๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ตฌํ•œ ํ›„ ๊ณ„์‚ฐ ๊ณผ์ •์„ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๋ฉด์„œ(back-tracing) ๊ฐ ์—ฐ์‚ฐ์˜ ์ข…๋ฅ˜๋ฅผ ํŒŒ์•…ํ•ด์•ผ ํ•œ๋‹ค.

์ตœ์ข… ์ธ๋ฑ์Šค (m,n)๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด ์ด์ „์˜ ์„ธ ๋ฐฉํ–ฅ ์ค‘์—์„œ ์ตœ์†Œ๊ฐ’์ด ์–ด๋Š ๋ฐฉํ–ฅ์œผ๋กœ๋ถ€ํ„ฐ ์™”๋Š”์ง€๋ฅผ ์ฒดํฌํ•˜๋ฉด์„œ ์—ฐ์‚ฐ์˜ ์ข…๋ฅ˜๋ฅผ ์•Œ์•„๋‚ธ๋‹ค.

์•„๋ž˜๋Š” ์ด์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์—ฌ๋Ÿฌ ํ…Œ์ŠคํŠธ์ผ€์ด์Šค์˜ ์ตœ์†Œ ํŽธ์ง‘ ๋น„์šฉ๊ณผ ์—ฐ์‚ฐ ๊ณผ์ •์„ ๋‚˜ํƒ€๋‚ธ ๊ฒฐ๊ณผ์ด๋‹ค.

LCS(Longest Common Subsequence) distacne

๋ฆฌ๋ฒค์Šˆํƒ€์ธ์˜ ํŽธ์ง‘๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ๊ณผ์˜ ์ฐจ์ด์ ์€ ๋Œ€์ฒด์—ฐ์‚ฐ์„ ํ—ˆ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์ด๋‹ค.
ํ•˜์ง€๋งŒ ๋Œ€์ฒด์—ฐ์‚ฐ์„ ํ—ˆ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์—ฐ์‚ฐ์˜ ์ข…๋ฅ˜๋ฅผ ํ•œ์ •์‹œํ‚ค๊ธฐ ์œ„ํ•จ์ด๋ผ๊ธฐ ๋ณด๋‹ค,
LCS distacne๋ฅผ ํ†ตํ•ด ์–ป๊ณ ์ž ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋Œ€์ฒด์—ฐ์‚ฐ์„ ํŠน๋ณ„์ทจ๊ธ‰ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
LCS distacne๋ฅผ ํ†ตํ•ด ์–ป๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ '์ตœ์žฅ ๊ณตํ†ต ๋ถ€๋ถ„ ์ˆ˜์—ด์˜ ๊ธธ์ด'์™€ ๊ทธ '์ˆ˜์—ด'์ด๋‹ค.
๋”ฐ๋ผ์„œ ๋ฆฌ๋ฒค์Šˆํƒ€์ธ ํŽธ์ง‘๊ฑฐ๋ฆฌ์˜ ๊ณผ์ •๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋ฉด์„œ๋„,
'๋น„์šฉ์ด ๋“ค์ง€ ์•Š๋Š” ๋Œ€์ฒด์—ฐ์‚ฐ'์˜ ๊ฐœ์ˆ˜์™€, ํ•ด๋‹น ์—ฐ์‚ฐ์˜ ๋Œ€์ƒ์ด ๋˜์—ˆ๋˜ ๋ฌธ์ž๋ฅผ ์•Œ์•„๋‚ด๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ชฉํ‘œ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ ํ™”์‹์„ ๋จผ์ € ์‚ดํŽด๋ณด์ž.

์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ, ํŽธ์ง‘๊ฑฐ๋ฆฌ ์ž์ฒด๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์ฃผ ๋ชฉ์ ์ด ์•„๋‹ˆ๋ฏ€๋กœ, ํ—ˆ์šฉ๋œ ์—ฐ์‚ฐ(์‚ฝ์ž…, ์‚ญ์ œ)์— ๋Œ€ํ•œ ๋น„์šฉ ๊ณ„์‚ฐ์ด ์—†๋‹ค. ๋‹ค๋งŒ xi=yix_i = y_i์˜ ๊ฒฝ์šฐ, ์ฆ‰ '๋น„์šฉ์ด ๋“ค์ง€ ์•Š๋Š” ๋Œ€์ฒด์—ฐ์‚ฐ'์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์— 1์„ ๋”ํ•ด์คŒ์œผ๋กœ์จ ๊ณตํ†ต ๋ถ€๋ถ„ ์ˆ˜์—ด์˜ ๊ธธ์ด๋ฅผ 1๋Š˜๋ฆฌ๊ณ , xi=yix_i = y_i์˜ ๊ฒฝ์šฐ ํ•ด๋‹น ์—ฐ์‚ฐ์ด ๋ถˆ๊ฐ€๋Šฅ ํ•˜๋ฏ€๋กœ, ์‚ฝ์ž… ์‚ญ์ œ ์ค‘์—์„œ ํ•˜๋‚˜๋ฅผ ๊ณ ๋ฅด๋Š”๋ฐ, ์ด ์ค‘ max๊ฐ’์„ ์„ ํƒํ•ด ๊ณตํ†ต ๋ถ€๋ถ„ ์ˆ˜์—ด์˜ ๊ธธ์ด๊ฐ€ ์ตœ์žฅ์œผ๋กœ ์œ ์ง€๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
์•„๋ž˜๋Š” ์ด์— ๋Œ€ํ•œ ์˜์‚ฌ์ฝ”๋“œ.

function LCSLength(X[1..m], Y[1..n])
    C = array(0..m, 0..n)
    for i := 0..m
        C[i,0] = 0
    for j := 0..n
        C[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j]
                C[i,j] := C[i-1,j-1] + 1
            else
                C[i,j] := max(C[i,j-1], C[i-1,j])
    return C[m,n]

dp

์ด๋Ÿฌํ•œ ์›๋ฆฌ๋กœ R = 'GAC', C = 'AGCAT' ์— ๋Œ€ํ•œ LCS๋ฅผ ๊ตฌํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ,
์™„์„ฑ๋œ dpํ…Œ์ด๋ธ”์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ํ…Œ์ด๋ธ” ๋งˆ์ง€๋ง‰ ์…€์— ์žˆ๋Š” 2๊ฐ€ LCS์˜ ๊ธธ์ด๊ฐ€ ๋œ๋‹ค.

back-tracing

๊ทธ ์ˆ˜์—ด์„ ์ง์ ‘ ๊ตฌํ•˜๊ณ ์ž ํ• ๋•Œ๋Š”, ์—ญ์‹œ back-tracing์„ ์ด์šฉํ•œ๋‹ค.


'๋น„์šฉ์ด ๋“ค์ง€ ์•Š๋Š” ๋Œ€์ฒด ์—ฐ์‚ฐ'์ด ํ˜„์žฌ ์ธ๋ฑ์Šค์—์„œ ๊ฐ€๋Šฅํ•œ์ง€ ํ™•์ธํ•œ๋‹ค. ์œ„์—์„œ๋Š” ์ฃผํ™ฉ์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ ์ธ๋ฑ์Šค๊ฐ€ ํ•ด๋‹น๋œ๋‹ค.
๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด LCS์— ์ถ”๊ฐ€ํ•˜๊ณ , ๋ถˆ๊ฐ€ํ•˜๋‹ค๋ฉด ์‚ฝ์ž…(ํ•œ์นธ ์™ผ์ชฝ), ์‚ญ์ œ(ํ•œ์นธ ์œ„)์ค‘ ์ตœ๋Œ€๊ฐ’์„ ์ฐพ์•„ ๋‹ค์‹œ ์ง„ํ–‰ํ•œ๋‹ค.
๋‹จ, LCS๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์ผ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ LCS๋ฅผ ์ฝ์–ด๋‚ผ ๋•Œ์™€, ๋ชจ๋“  LCS๋ฅผ ์ „๋ถ€ ์ฐพ์„ ๋•Œ์˜ ์ฝ”๋“œ๊ฐ€ ๋‹ค๋ฅด๋‹ค.

  • ํ•˜๋‚˜๋งŒ ์ฝ์–ด๋‚ด๋Š” ๊ฒฝ์šฐ
    ํ˜„์žฌ ์ธ๋ฑ์Šค์˜ ๋‘ ๋ฌธ์ž (X[i]์™€ Y[j])๊ฐ€ ๊ฐ™๋‹ค๋ฉด( = '๋น„์šฉ์ด ๋“ค์ง€ ์•Š๋Š” ๋Œ€์ฒด ์—ฐ์‚ฐ'์ด ํ˜„์žฌ ์ธ๋ฑ์Šค์—์„œ ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด) ๊ทธ ๋ฌธ์ž๋Š” LCS์•ˆ์— ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ ๊ฐ’์— ์ถ”๊ฐ€ํ•ด์ค€๋‹ค.
    ๋‘ ๋ฌธ์ž๊ฐ€ ๊ฐ™์ง€ ์•Š๋‹ค๋ฉด, ์œ„์ชฝ ๊ฐ’๊ณผ ์•„๋žซ์ชฝ ๊ฐ’ ์ค‘ ํฐ ๊ฐ’์„ ์ฐพ์•„์„œ LCS์˜ ๊ธธ์ด๊ฐ€ ์ตœ๋Œ€๊ฐ€ ๋˜๊ฒŒ ์œ ์ง€ํ•ด์ฃผ๋ฉด์„œ ์ด๋™ํ•œ๋‹ค. ์œ„์ชฝ ๊ฐ’๊ณผ ์•„๋žซ์ชฝ ๊ฐ’์ด ๊ฐ™๋‹ค๋ฉด LCS๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์กด์žฌํ•œ๋‹ค๋Š” ์‹ ํ˜ธ์ธ๋ฐ, ์ง€๊ธˆ์€ ํ•˜๋‚˜๋งŒ ์ฝ์–ด๋‚ด๋Š” ๊ฒฝ์šฐ ์ด๋ฏ€๋กœ ์•„๋ฌด๊ฑฐ๋‚˜ ์„ ํƒํ•˜๋ฉด ๋œ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์—์„œ๋Š” bold์ฒ˜๋ฆฌ๋œ path๊ฐ€ ํ˜„์žฌ ์ฝ์–ด๋‚ด๋Š” ํ•˜๋‚˜์˜ LCS์ด๋‹ค.
function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i = 0 or j = 0
        return ""
    if  X[i] = Y[j]
        return backtrack(C, X, Y, i-1, j-1) + X[i]
    if C[i,j-1] > C[i-1,j]
        return backtrack(C, X, Y, i, j-1)
    return backtrack(C, X, Y, i-1, j)
  • ๋ชจ๋“  LCS๋ฅผ ์ฝ์–ด๋‚ด๋Š” ๊ฒฝ์šฐ
    ํ•˜๋‚˜๋งŒ ์ฝ์–ด๋‚ด๋Š” ๊ฒฝ์šฐ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ,
    X[i]์™€ Y[i]๊ฐ€ ๊ฐ™์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์œ„์ชฝ ๊ฐ’๊ณผ ์•„๋žซ์ชฝ ๊ฐ’์ด ๊ฐ™๋‹ค๋ฉด, ๋‘˜๋‹ค ํƒ์ƒ‰ํ•˜๋Ÿฌ ๋“ค์–ด๊ฐ์œผ๋กœ์จ ๋ชจ๋“  LCS๋ฅผ ์ฝ์–ด๋‚ธ๋‹ค. ์ง€์ˆ˜์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
function backtrackAll(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i = 0 or j = 0
        return {""}
    if X[i] = Y[j]
        return {Z + X[i] for all Z in backtrackAll(C, X, Y, i-1, j-1)}
    R := {}
    # C[i-1,j] == C[i,j-1]์ธ ๊ฒฝ์šฐ ์•„๋ž˜์˜ a,b ๋‘˜๋‹ค ๋งŒ์กฑํ•ด ๋‘˜๋‹ค ์‹คํ–‰๋œ๋‹ค.
    # ํ•œ์ชฝ์ด ํด๊ฒฝ์šฐ a์™€ b ์ค‘ ํ•˜๋‚˜๋งŒ ์‹คํ–‰๋œ๋‹ค.
    if C[i,j-1] โ‰ฅ C[i-1,j]						# a
        R := backtrackAll(C, X, Y, i, j-1)
    if C[i,j-1] โ‰ค C[i-1,j]						# b
        R := R โˆช backtrackAll(C, X, Y, i-1, j)	# a๊ฐ€ ์‹คํ–‰๋˜์—ˆ์„ ๋•Œ๋ฅผ ๋Œ€๋น„ํ•˜์—ฌ R๊ณผ ํ•ฉ์ณ์ค€๋‹ค.
        									
    return R

์ฐธ๊ณ ์‚ฌ์ดํŠธ

https://en.wikipedia.org/wiki/Edit_distance
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem

profile
I think I think too much.

0๊ฐœ์˜ ๋Œ“๊ธ€