Different embeddings+ LDAย +ย Jensen-Shannon distanceย ๐Ÿ˜Š

jjยท2021๋…„ 2์›” 26์ผ
0

SS-hashtag-recommendation-project

๋ชฉ๋ก ๋ณด๊ธฐ
4/15

LDA with Jensen-Shannon distance:

LDA has many uses:

  • Understanding the different varieties topics in a corpus (obviously),
  • Getting a better insight into the type of documents in a corpus (whether they are about news, wikipedia articles, business documents)
  • Quantifying the most used / most important words in a corpus
  • document similarity and recommendation

Latent Dirichlet Allocation (LDA):

An unsupervised generative model that assigns topic distributions to documents.

  • high level์—์„œ, ๋ชจ๋ธ์€ ๊ฐ๊ฐ์˜ ๋ฌธ์„œ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ํ† ํ”ฝ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฌธ์„œ๊ฐ„์— ํ† ํ”ฝ์ด ์„œ๋กœ ๊ฒน์น  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. โ†’๋˜ํ•œ ํ† ํ”ฝ ๊ฐ„์— ๊ณต์œ ๋˜๋Š” ๋™์ผํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค
  • ๊ฐ ๋ฌธ์„œ์˜ ๋‹จ์–ด๋“ค์€ ๋ฌธ์„œ์˜ ํ† ํ”ฝ์— ์˜ํ–ฅ์„ ์ค€๋‹ค. ํ† ํ”ฝ์ด ์„ธ๋ถ€์ ์œผ๋กœ ์ •์˜๋  ํ•„์š”๋Š” ์—†์ง€๋งŒ, "๋ช‡ ๊ฐœ์˜ ํ† ํ”ฝ"์ด ์žˆ๋Š”์ง€๋Š” ์‚ฌ์ „์— ์ •์˜๋˜์–ด์•ผ ํ•œ๋‹ค.

The model generates toย latentย (hidden) variables :

(1) ๊ฐ ๋ฌธ์„œ๋“ค์˜ ํ† ํ”ฝ๋“ค์— ๋Œ€ํ•œ ๋ถ„ํฌ

(2) ๊ฐ ํ† ํ”ฝ๋“ค์˜ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ๋ถ„ํฌ

ํ•™์Šต ํ›„, ๊ฐ ๋ฌธ์„œ๋“ค์€ ๋ชจ๋“  ํ† ํ”ฝ์— ๋Œ€ํ•ด discrete ๋ถ„ํฌ๋ฅผ ๊ฐ€์งˆ ๊ฒƒ์ด๋ฉฐ, ๊ฐ ํ† ํ”ฝ๋“ค์€ ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•ด discrete ๋ถ„ํฌ๋ฅผ ๊ฐ€์งˆ ๊ฒƒ์ด๋‹ค.

Collapsed gibbs sampling

http://geference.blogspot.com/2011/11/blog-post_30.html

profile
์žฌ๋ฐŒ๋Š”๊ฒŒ ์žฌ๋ฐŒ๋Š”๊ฑฐ๋‹ค

0๊ฐœ์˜ ๋Œ“๊ธ€