๐Ÿคฌ๋ถ„๋…ธ์˜ ๋Œ€ํƒˆ์ถœ4 ์ฒซ๋ฐฉ ๋ฐ˜์‘ wordcloud

์•„๋ฌด๊ฐœ์”จยท2021๋…„ 7์›” 12์ผ
0

๋Œ€ํƒˆ์ถœ

๋ชฉ๋ก ๋ณด๊ธฐ
1/1
post-thumbnail
์ˆ˜์ • (2021-07-23)

โ€ปโ€ป ๋Œ€ํƒˆ์ถœ 1ํ™” ๊ด€๋ จ๋œ ์Šคํฌ์ผ๋Ÿฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. โ€ปโ€ป

github : https://github.com/KeunhoLee/dtcu4_first_ep

๋…ธ๋ž‘ํ†ต๋‹ญ ์ด์ œ ์•ˆ๋จน๋Š”๋‹ค.

๋Œ€ํƒˆ์ถœ ์‹œ์ฆŒ4 ์ฒซํ™”๋Š” ๋งํ–ˆ๋‹ค.

๋‚˜๋Š” ์žฌ๋Šฅ์žˆ๋Š” PD๊ฐ€ ์‹ฌํ˜ˆ์„ ๊ธฐ์šธ์—ฌ ๋งŒ๋“  ํ•œ์‹œ๊ฐ„ ๋ฐ˜์งœ๋ฆฌ ๋…ธ๋ž‘ํ†ต๋‹ญ ๊ด‘๊ณ ๋ฅผ ๋ดค๋‹ค.

์‹œ์ฆŒ3 ์ข…์˜ ์ดํ›„ ๋ฌด๋ ค 1๋…„์ด๋‚˜ ๊ธฐ๋‹ค๋ ธ๋Š”๋ฐ...

๋‹ค๋ฅธ์‚ฌ๋žŒ๋“ค์€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ• ๊นŒ, ์ผ๋‹จ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ด ๋งŒ๋งŒํ•œ ํŠธ์œ„ํ„ฐ ๋ฐ˜์‘๋ถ€ํ„ฐ ์‚ดํŽด๋ณด๊ธฐ๋กœ ํ–ˆ๋‹ค.

ํŠธ์œ„ํ„ฐ์—์„œ ๋Œ€ํƒˆ์ถœ ๊ด€๋ จ ํŠธ์œ— ์ˆ˜์ง‘ํ•˜๊ธฐ

R ํŒจํ‚ค์ง€ rtweet ์„ ์‚ฌ์šฉํ•˜๋ฉด ํ‚ค์›Œ๋“œ๋ฅผ ์ด์šฉํ•ด ์‰ฝ๊ฒŒ ํŠธ์œ—์„ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋‹ค.

search_tweets์„ ์ด์šฉํ•˜๋ฉด ํ‚ค์›Œ๋“œ๋ฅผ ์ด์šฉํ•ด ๊ฐ„ํŽธํžˆ ์ˆ˜์ง‘ ๊ฐ€๋Šฅํ•˜๋‹ค.

library("rtweet")

TWEET_N <- 18000
HASHTAG <- "๋Œ€ํƒˆ์ถœ"

rt <- search_tweets(HASHTAG,
                    n=TWEET_N, # ์ˆ˜์ง‘ํ•  tweet ์ˆ˜ (max 18000)
                    include_rts=FALSE) #๋ฆฌํŠธ์œ— ํฌํ•จ์—ฌ๋ถ€ FALSE ๋ฏธํฌํ•จ, TRUE ํฌํ•จ

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์ •์ธ์ฆ์„ ์š”๊ตฌํ•œ๋‹ค.

๋งค๋ฒˆ ๋ธŒ๋ผ์šฐ์ €์—์„œ ๋กœ๊ทธ์ธํ•˜๊ธฐ๋Š” ๊ท€์ฐฎ์œผ๋ฏ€๋กœ https://developer.twitter.com/en ์—์„œ API ์‚ฌ์šฉ์‹ ์ฒญ์„ ํ•œ ํ›„ token์„ ์ƒ์„ฑํ•ด ์ €์žฅํ•ด ๋†“๋Š”๋‹ค. (github์˜ 1. create_token.R)

TOKEN_NAME <- "./token/twitter_token.rds"
TWEET_N <- 18000
HASHTAG <- "๋Œ€ํƒˆ์ถœ"

twitter_token <- readRDS(TOKEN_NAME)

rt <- search_tweets(HASHTAG,
                    n=TWEET_N,
                    include_rts=FALSE,
                    token = twitter_token)

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋งค๋ฒˆ ์ธ์ฆํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค.

์ฒ˜์Œ์—๋Š” "#๋Œ€ํƒˆ์ถœ"์„ ์ด์šฉํ•ด ๊ฒ€์ƒ‰์„ ํ–ˆ์œผ๋‚˜ ์ˆ˜์ง‘๋˜๋Š” ํŠธ์œ— ์–‘์ด ๋„ˆ๋ฌด ์ ์–ด ํ•ด์‹œํƒœ๊ทธ๋ฅผ ๋–ผ๊ณ  "๋Œ€ํƒˆ์ถœ" ํ‚ค์›Œ๋“œ๋กœ ์ˆ˜์ง‘ํ–ˆ๋‹ค.

์ˆ˜์ง‘๋œ text๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ๊ต‰์žฅํžˆ ๋งŽ์€ ์ปฌ๋Ÿผ๋“ค์ด ํฌํ•จ๋˜์–ด์žˆ๋Š”๋ฐ ์—ฌ๊ธฐ์„œ๋Š” ๋ณธ๋ฌธ text๋งŒ ํ•„์š”ํ•˜๋ฏ€๋กœ text์ปฌ๋Ÿผ๋งŒ ๊ฐ€์ ธ์˜ค๊ณ , ๋’ค์—์„œ ์˜๋ฏธ์—ฐ๊ฒฐ๋ง์„ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด ํŠธ์œ—๋ณ„ id๋ฅผ ๋ถ€์—ฌํ•ด์ค€๋‹ค.

์„ฑ๊ณต์ ์œผ๋กœ ๋Œ€ํƒˆ์ถœ ๊ด€๋ จ tweet์„ ์ˆ˜์ง‘ํ–ˆ์ง€๋งŒ ์ด๋ชจ์ง€, ๋งํฌ, ํŠน์ˆ˜๋ฌธ์ž ๋“ฑ ๋ถˆํ•„์š”ํ•œ ์ฐŒ๊บผ๊ธฐ๋“ค์ด ๋งŽ์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

์ด์ œ ์ด๊ฒƒ๋“ค์„ ์ฒ˜๋ฆฌํ•ด๋ณด์ž

Text ์ „์ฒ˜๋ฆฌ ํ•˜๊ธฐ

1. ํ•„์š”ํ•œ ํ…์ŠคํŠธ๋งŒ ๋‚จ๊ธฐ๊ธฐ

์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด ๋ถˆํ•„์š”ํ•œ ํ…์ŠคํŠธ๊ฐ€ ๋งŽ์ด ํฌํ•จ๋˜์–ด ์žˆ์–ด ๋ชจ๋‘ ์ œ๊ฑฐํ•ด ์ฃผ์—ˆ๋‹ค.

์ œ๊ฑฐํ•œ ๊ฒƒ๋“ค

  • URL
  • ๋‹ค๋ฅธ์œ ์ € ์–ธ๊ธ‰ (@)
  • ์ด๋ชจ์ง€ (๐Ÿ˜ ๊ฐ™์€ ๊ฒƒ)
  • ์˜จ์ „ํ•œ ํ•œ๊ธ€, ์˜์–ด ์ œ์™ธ ๋ชจ๋“  ๊ฒƒ

์ถ”๊ฐ€์ฒ˜๋ฆฌ

  • ์–‘์ชฝ ๋ ๊ณต๋ฐฑ ์ œ๊ฑฐ
  • 2๊ฐœ ์ด์ƒ ๊ณต๋ฐฑ์„ ๊ณต๋ฐฑํ•˜๋‚˜๋กœ ๋ณ€ํ™˜
  • ์˜์–ด๋ฅผ ์ „๋ถ€ ๋Œ€๋ฌธ์ž๋กœ ๋ณ€ํ™˜
rmURLs <- function(x) { gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T) }
rmTag <- function(x) { gsub("(@[A-Za-z๊ฐ€-ํžฃ0-9_]+)", "", x, perl=T) }
rmEmoji <- function(x) { gsub("[\U00010000-\U0010FFFF]+", "", x, perl=T) }
toSpace <- function(x, pattern) { gsub(pattern, " ", x) }

preprocess_text <- function(text_df) {
  
  text_df %>%
    mutate(text=rmURLs(text),
           text=rmTag(text),
           text=rmEmoji(text),
           text=toSpace(text, "\n"),
           text=toSpace(text, "[^๊ฐ€-ํžฃA-Za-z]"),
           text=gsub(" +", " ", text),
           text=trimws(text),
           text=toupper(text))
  
}

texts <- preprocess_text(texts)

๊น”๋”ํ•ด์กŒ๋‹ค!

2. ๋‹จ์–ด ์ถ”์ถœํ•˜๊ธฐ (์‚ฌ์šฉ์ž ์‚ฌ์ „)

์œ„์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ๋ฐ”๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์‹œ๋„ํ•˜๋‹ˆ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค.

๋Œ€ํƒˆ์ถœ, ์—ฌ๊ณ ์ถ”๋ฆฌ๋ฐ˜, ๋…ธ๋ž‘ํ†ต๋‹ญ, PPL ๋“ฑ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๋“ค์ด ์ œ๋Œ€๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„์ด ๋˜์ง€ ์•Š์•˜๋‹ค.

์ด๋Ÿฐ๊ฒฝ์šฐ ํ˜•ํƒœ์†Œ๋ถ„์„๊ธฐ์— ๋Œ€ํƒˆ์ถœ ๊ด€๋ จ ์šฉ์–ด๋“ค์„ ์‚ฌ์šฉ์ž ์‚ฌ์ „์„ ๋“ฑ๋กํ•ด์ฃผ๋ฉด ๋˜์ง€๋งŒ ๊ต‰์žฅํžˆ ๊ท€์ฐฎ์€ ์ผ์ด๋‹ค... ์žฌ๋ฏธ๋กœ ํ•˜๋Š”์ผ์— ๋„ˆ๋ฌด ๋งŽ์€ ๊ณต์ˆ˜๋ฅผ ๋“ค์ด๊ธฐ๋Š” ์‹ซ์–ด์„œ ๊ฒ€์ƒ‰์„ ํ•ด๋ณด๋˜ ์ค‘ ํฅ๋ฏธ๋กœ์šด python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ์–ด ์‹œ๋„ํ•ด ๋ณด์•˜๋‹ค.

soynlp ํŒจํ‚ค์ง€

์ฃผ์–ด์ง„ ๋ฌธ์„œ๋ฅผ ์ฝ์–ด์„œ ํ•ด๋‹น ๋ฌธ์„œ ์•ˆ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ช…์‚ฌ๋กœ ์ถ”์ •๋˜๋Š” ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•ด์ฃผ๋Š” Python๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‹ค.

์ด ๊ฒฝ์šฐ์—๋Š” ๋Œ€ํƒˆ์ถœ ๋ฐฉ์†ก๊ณผ ๊ด€๋ จ๋œ ๋‹จ์–ด๋“ค์„ ๊ฝค๋‚˜ ์ถ”์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์™„๋ฒฝํ•œ ์‚ฌ์šฉ์ž ์‚ฌ์ „์ด๋ผ๊ณ  ํ•  ์ˆœ ์—†์ง€๋งŒ ๋‚˜๋ฆ„ ์“ธ๋งŒํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๊ณ  ์žฌ๋ฏธ๋กœ ํ•˜๋Š”๊ฑฐ๋ผ ์ด์ •๋„๋งŒ ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

์ด๋ถ€๋ถ„๋งŒ python์œผ๋กœ ๊ตฌํ˜„ํ–ˆ๊ณ  ์ถ”์ถœ๋œ ๋‹จ์–ด๋“ค์—์„œ ํ•„์š”ํ•ด๋ณด์ด๋Š” ๊ฒƒ๋“ค๋งŒ ์ˆ˜๋™์œผ๋กœ ๊ณจ๋ผ๋‚ด๊ณ  ์ถœ์—ฐ์ง„ ์ด๋ฆ„์ •๋„๋งŒ ์ถ”๊ฐ€ํ•œ ํ›„ user_dict.txt์— ์ €์žฅํ•ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์— ์ ์šฉํ–ˆ๋‹ค. (word_extraction.ipynb)

์ด๋Ÿฐ๊ฒŒ ์žˆ๋Š”์ค„ ์•Œ์•˜์œผ๋ฉด ์ฒจ๋ถ€ํ„ฐ ์ „๋ถ€ Python์œผ๋กœ ํ•  ๊ฑธ

corpus_fname = 'twitter_text.txt'

with open(corpus_fname, 'r', encoding="cp949") as f:
    lines = f.readlines()
    
lines = [s.replace("\n", "").replace('"', "") for s in lines][1:]
lines = [s.upper() for s in lines][1:]
from soynlp.word import WordExtractor

word_extractor = WordExtractor(
    max_left_length=6,
    min_frequency=10,
    min_cohesion_forward=0.02, 
    min_right_branching_entropy=0.0
)

word_extractor.train(lines)
words = word_extractor.extract()
import math

def word_score(score):
    return (score.cohesion_forward * math.exp(score.right_branching_entropy))

print('๋‹จ์–ด   (๋นˆ๋„์ˆ˜, cohesion, branching entropy)\n')
for word, score in sorted(words.items(), key=lambda x:word_score(x[1]), reverse=True)[:100]:
    print('%s     (%d, %.3f, %.3f)' % (
            word, 
            score.leftside_frequency, 
            score.cohesion_forward,
            score.right_branching_entropy
            )
         )

๋Œ€ํƒˆ์ถœ, ์Šค์ผ€์ผ, ๋…ธ๋ž‘ํ†ต๋‹ญ, ํƒ€์ž„๋จธ์‹ , ํ”ผํ”ผ์—˜, ์น˜ํ‚จ, ์‹œ์ฆŒ ๋“ฑ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์ถ”์ถœ๋œ๋‹ค.

์ดํ›„ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ์—์„œ๋„ ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค.

3. ํ˜•ํƒœ์†Œ ๋ถ„์„ํ•˜๊ธฐ

ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋Š” tidytext ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ์‰ฝ๊ณ  ์„ค์น˜๋„ ๊ฐ„๋‹จํ•œ Elbird ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

library("tidytext")
library("Elbird")

tokenize_text <- function(text_df) {
  
  text_df %>% 
    unnest_tokens(
      input = text,
      output = word,
      token = analyze_tidy
    ) %>%
    separate(word, sep="/", into=c("word", "morph"))
  
}

read_user_dict("./user_dict.txt")
words <- tokenize_text(texts)

ํ˜•ํƒœ์†Œ๋ณ„๋กœ ๋ถ„๋ฆฌ๋œ ๋‹จ์–ด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์–ป์—ˆ๋‹ค!!

ใ„ด๊นŒ, ใ„น๊นŒ, ์— ๊ฐ™์€ ํ˜•ํƒœ์†Œ๋“ค์€ ํ•„์š” ์—†์œผ๋ฏ€๋กœ ๊ด€์‹ฌ์žˆ๋Š” ํ’ˆ์‚ฌ๋งŒ ๋‚จ๊ธฐ๊ณ  ์ œ์™ธํ•˜๊ณ  ์ „๋ถ€ ์ œ๊ฑฐํ•ด์ค€๋‹ค.

  • ํ˜•์šฉ์‚ฌ ๋’ค์—๋Š” ๋‹ค๋ฅผ ๋ถ™์—ฌ์ค˜์„œ ๋ณด๊ธฐ ์ข‹๊ฒŒ ๋งŒ๋“ค์ž
    ex) "์žฌ๋ฏธ์žˆ" -> "์žฌ๋ฏธ์žˆ๋‹ค"
  • ๊ฐ™ ์ด๋ž€ ํ˜•์šฉ์‚ฌ๊ฐ€ ํฐ ์˜๋ฏธ์—†์ด ๋„ˆ๋ฌด ๋งŽ์ด ๋“ฑ์žฅํ•ด์„œ ์ œ๊ฑฐํ–ˆ๋‹ค. ( ๋ณ„๋กœ์ธ ๊ฒƒ ๊ฐ™์€๋ฐ... ์ฒ˜๋Ÿผ ๋ง๋์— "๊ฐ™๋‹ค"๋ฅผ ์“ฐ๋Š”์‚ฌ๋žŒ์ด ๋งŽ์•˜๋‹ค. )

๊ทธ๋ฆฌ๊ณ  ๋‚จ์€ ๋‹จ์–ด๋“ค์„ ๊ฐฏ์ˆ˜๋ฅผ ์„ธ์„œ ๋นˆ๋„์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด์ค€๋‹ค.

target_morph <- c("nng", "nnp", "va", "xr", "sl", "@")

synonym <- data.table::fread("synonym.csv", encoding = "UTF-8")
synonym_dict <- NULL

for (i in 1:dim(synonym)[1]){
  # print(i)
  synonym_dict[synonym$og[i]] <- synonym$synonym[i]
  
}

processed_word <- words %>%
  filter(morph %in% target_morph,
         word!="๊ฐ™") %>%
  mutate(word=ifelse(morph=="va", paste0(word,"๋‹ค"), word),
         word=ifelse(word %in% names(synonym_dict),
                     synonym_dict[word],
                     word),
         word=toupper(word)) %>%
  count(word) %>%
  filter(nchar(word)>1,
         n>10) %>%
  rename(freq=n) %>%
  arrange(desc(freq))

wordcloud ๊ทธ๋ฆฌ๊ธฐ

์ด์ œ wordcloud ๊ทธ๋ฆฌ๊ธฐ๋Š” ์•„์ฃผ ๊ฐ„๋‹จํ•˜๋‹ค.

library(wordcloud2)

SOURCE_NAME <- "twitter"

# preprocess.R ๋จผ์ € ์‹คํ–‰
processed_word$freq[1] <- min(processed_word$freq[1],
                              processed_word$freq[2]*2)

wc <- wordcloud2(
  processed_word,
  size=1.5,
  color = c("black",
            sample(
              rep_len(gray.colors(20, start = 0, end = .4),
                      nrow(processed_word) - 1),
              nrow(processed_word) - 1
            )),
  backgroundColor = "#FFE400",
  rotateRatio = .4,
  shape = "diamond",
  gridSize = 7,
  ellipticity = .6,
  shuffle=FALSE
)

wc

์›Œ๋“œํด๋ผ์šฐ๋“œ์—์„œ๋Š” ๋นˆ๋„์ˆ˜ = ๊ธ€์žํฌ๊ธฐ ์ธ๋ฐ ๋Œ€ํƒˆ์ถœ ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๊ฒ€์ƒ‰ ํ‚ค์›Œ๋“œ๋ผ ๊ทธ๋Ÿฐ์ง€ ๋ชจ๋“ ํŠธ์œ—์— ํฌํ•จ๋˜์„œ ๋„ˆ๋ฌด ํฌ๊ฒŒ ๋‚˜์™”๋‹ค. ์ ๋‹นํžˆ ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•ด์ค˜์„œ ๋ณด๊ธฐ ์˜ˆ์˜๊ฒŒ ํ•œ๋‹ค.

๋Œ€ํƒˆ์ถœ ํฌ์Šคํ„ฐ๋ฅผ ๋ณด๋ฉด ๋…ธ๋ž€ ๋ฐ”ํƒ•์— ๊ฒ€์€์ƒ‰ ๊ธ€์”จ๋งŒ์‚ฌ์šฉํ•˜๋‹ˆ๊นŒ ๋งž์ถฐ์„œ ๊ทธ๋ ค๋ณด์ž.

๋…ธ๋ž€ ๋ฐ”ํƒ•์ƒ‰์€ ๋Œ€ํƒˆ์ถœ ํฌ์Šคํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•ด์„œ color picker๋กœ ์ฐ์–ด์™”๋‹ค.

๋Œ€ํƒˆ์ถœ ํ…Œ๋งˆ ์ƒ‰์œผ๋กœ ๋งž์ถฐ์„œ ๊ทธ๋ ธ๋Š”๋ฐ, ๋…ธ๋ž‘ํ†ต๋‹ญ ์ƒ‰๊น”์ด๋‹ค ํ™”๊ฐ€๋‚œ๋‹ค.

๊ทธ๋ž˜๋„ ์˜ˆ์˜๊ฒŒ ์ž˜ ๋‚˜์™”๋‹ค. ํ•œ๋ฒˆ ์‚ดํŽด๋ณด์ž

  • ์—ญ์‹œ PPL, ๋…ธ๋ž‘ํ†ต๋‹ญ ์ด ์•„์ฃผ ํฌ๊ฒŒ ๋ณด์ธ๋‹ค.
  • ์žฌ๋ฐŒ๋‹ค ๊ฐ€ ์•„์ฃผ ๋งŽ์ด ๋“ฑ์žฅํ•œ๋‹ค. ์žฌ๋ฐŒ๊ฒŒ ๋ณธ์‚ฌ๋žŒ์ด ๋งŽ๊ตฌ๋‚˜...
  • ์ด๋ฒˆํ™”๋Š” ์•„์Šค๋‹ฌ์—ฐ๋Œ€๊ธฐ ์„ธํŠธ์žฅ์—์„œ ์ดฌ์˜์„ ํ–ˆ๊ธฐ๋•Œ๋ฌธ์— ์•„์Šค๋‹ฌ, ์—ฐ๋Œ€๊ธฐ, ์„ธํŠธ์žฅ ๊ฐ™์€ ๋‹จ์–ด๋“ค๋„ ๋ณด์ธ๋‹ค.
  • ์ •์ข…์—ฐPD์˜ ์ด์ „์ž‘ํ’ˆ ์—ฌ๊ณ ์ถ”๋ฆฌ๋ฐ˜ ์— ๋Œ€ํ•œ ์–ธ๊ธ‰์ด ๋งŽ์•˜๋‹ค.
  • ๊ทธ์™ธ์— ์‹œ์ฆŒ, ์‹œ์ž‘, ๊ธฐ๋Œ€, ์Šค์ผ€์ผ ๋“ฑ๋“ฑ ์ฒซ๋ฐฉ์˜ ๊ธฐ๋Œ€๊ฐ์— ๋Œ€ํ•œ ๋‚ด์šฉ๋“ค๋„ ๋ณด์ธ๋‹ค.

์˜๋ฏธ์—ฐ๊ฒฐ๋ง ๊ทธ๋ฆฌ๊ธฐ

ํ•˜๋Š”๊น€์— ์˜๋ฏธ์—ฐ๊ฒฐ๋ง๋„ ๊ทธ๋ ค๋ณด์ž

library("widyr")
library("tidygraph")
library("ggraph")
library("showtext")

SOURCE_NAME <- "twitter"

texts <- readRDS(get_latest_data(SOURCE_NAME)) %>%
  get_text(SOURCE_NAME)

texts <- preprocess_text(texts)
words <- tokenize_text(texts)
# -------------------------------------------------------------------------

target_morph <- c("nng", "nnp", "va", "xr", "sl", "@")

pair <- words %>%
  mutate(word=ifelse(morph=="va", paste0(word,"๋‹ค"), word),
                       word=ifelse(word %in% names(synonym_dict),
                                   synonym_dict[word],
                                   word),
                       word=toupper(word)) %>%
  filter(word!="๋Œ€ํƒˆ์ถœ",
         word!="๊ฐ™๋‹ค",
         nchar(word)>1,
         morph %in% target_morph) %>%
  pairwise_count(item=word,
                 feature=id,
                 sort=T)


# ๊ด€๋ จ์—†๋Š” ํ‚ค์›Œ๋“œ ์‚ญ์ œ
trash <- c("LT", "GT",
           "๊ฒฐ์ œ", "ํ‹ฐ๋น™",
           "์œ ๋‹ˆ", "๋ฒ„์Šค",
           "ํ”ผ์˜ค", "๋ธ”๋ฝ๋น„", "BLOCKB",
           "๊ฐ™๋‹ค",
           "์ธ์„ฑ", "์ตœ๊ณ ", "์‹ ์‚ฌ", "์ถ•ํ•˜", "์ƒ์ผ", "HAPPYINSEONGDAY")

set.seed(1)
graph_component <- pair %>%
  filter(n>7,
         !((item1 %in% trash) & (item1 %in% trash))) %>%
  as_tbl_graph(directed=FALSE) %>%
  mutate(centrality=centrality_degree(),
         group=as.factor(group_infomap()))

ggraph(graph_component,
      layout="nicely") +
  geom_edge_link(color="gray50",
                 alpha=.5) + 
  geom_node_point(aes(color=group,
                  size=centrality),
                  show.legend = FALSE) +
  scale_size(range=c(5, 15)) +
  geom_node_text(aes(label=name),
                 repel=TRUE,
                 size=5,
                 family="naumgothic") +
  theme_graph()

์˜๋ฏธ์—ฐ๊ฒฐ๋ง์„ ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋‹ˆ ์ถœ์—ฐ์ง„์ค‘ ํ•œ๋ช…์ธ ํ”ผ์˜ค์™€ ์†Œ์†๊ทธ๋ฃน BLOCKB์— ๋Œ€ํ•œ ์–ธ๊ธ‰์ด ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์ œ์™ธํ•ด์ฃผ์—ˆ๋‹ค.

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์‚ดํŽด๋ณด์ž

  • ์ขŒ์ธก ์ƒ๋‹จ์— ์•„์Šค๋‹ฌ์—ฐ๋Œ€๊ธฐ ์„ธํŠธ์žฅ์— ๋Œ€ํ•œ ์–ธ๊ธ‰๋“ค์ด ๋”ฐ๋กœ ๋ถ„๋ฆฌ๋œ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์šฐ์ธก ๋…ธ๋ž‘ํ†ต๋‹ญ์— ๋Œ€ํ•œ ์–ธ๊ธ‰๋“ค๋„ ๋”ฐ๋กœ ๋ถ„๋ฆฌ๋œ๋‹ค.
  • ์ค‘์•™์ชฝ์— ์—ฌ๊ณ ์ถ”๋ฆฌ๋ฐ˜, ๋…ธ๋ž‘ํ†ต๋‹ญ, PPL ์ด ์—ฐ๊ฒฐ๋˜์žˆ์–ด์„œ ์ข€ ์ฐพ์•„๋ดค๋”๋‹ˆ, ์ด์ „์ž‘์ธ ์—ฌ๊ณ ์ถ”๋ฆฌ๋ฐ˜์—์„œ ๋…ธ๋ž‘ํ†ต๋‹ญ์„ ์—„์ฒญ ๋…ธ์ถœ์‹œ์ผœ์„œ ๊ฒฐ๊ตญ PPL์„ ๋”ฐ๋ƒˆ๋‹ค๋Š” ๋‚ด์šฉ์˜ ํŠธ์œ—์ด ๋งŽ์•˜๋‹ค.
  • ์ค‘์•™ ํ•˜๋‹จ์ชฝ ์‹œ์ฆŒ์‹œ์ž‘์— ๋Œ€ํ•œ ์–ธ๊ธ‰๋“ค๋„ ๋งŽ์ด ๋ณด์ธ๋‹ค.
  • ๊ฒฝ์„ฑ๋ถ€ํ„ฐ ์ด์–ด์ง€๋Š” ํƒ€์ž„๋จธ์‹  ์„ธ๊ณ„๊ด€์˜ ์Šค์ผ€์ผ์— ๋Œ€ํ•œ ๊ฐํƒ„๊ณผ ์Šค์ผ€์ผ๋งŒ ํฌ์ง€ ์žฌ๋ฏธ์—†๋‹ค๋Š” ํ˜นํ‰์ด ๋™์‹œ์— ๋“ฑ์žฅํ•œ๋‹ค.
  • ํ˜ธ๋ž‘์ด๊ด€๋ จ ๋‚ด์šฉ์€ ์ฒซ๋ฐฉ์„ ๋ณด์‹  ๋ถ„์ด๋ผ๋ฉด ์•„๋งˆ ์•Œ ๋“ฏ ํ•˜๋‹ค.

๋งˆ๋ฌด๋ฆฌ

๋‚˜๋ฆ„ ์žฌ๋ฏธ์žˆ์—ˆ๋‹ค.

์ฒซ๋ฐฉ์ดํ›„ ์ฐ”๋”์ฐ”๋” ํฌ์ŠคํŒ…ํ•˜๋‹ค๋ณด๋‹ˆ ์–ด๋Š์ƒˆ 2ํ™”๋„ ๋ฐฉ์˜ํ–ˆ๋‹ค.
๋ฌผ๋ก  2ํ™” ์ดํ›„์—๋„ ๋™์ผํ•œ ์ฝ”๋“œ๋ฅผ ๋Œ๋ ค์„œ ๋ถ„์„ํ•ด๋ณด์•˜๋‹ค.
์ด๊ฒƒ๋„ ๋‚˜์ค‘์— ๊ฒฐ๊ณผ๋งŒ ํฌ์ŠคํŒ…ํ•ด์•ผ์ง€...

์ฐธ, ์œ ํŠœ๋ธŒ ๋‹ค์Œํ™” ์˜ˆ๊ณ ํŽธ์—์„œ ๋Œ“๊ธ€ ์ˆ˜์ง‘ํ•ด์„œ ๋งŒ๋“  ์œ ํŠœ๋ธŒ ๋ฒ„์ „๋„ ์žˆ๋Š”๋ฐ ์ด์ชฝ์€ ๋น„ํŒ์ ์ธ ์˜๊ฒฌ์ด ํ›จ์”ฌ ๋งŽ์€ ๊ฒƒ ๊ฐ™๋‹ค.
์œ ํŠœ๋ธŒ ๋Œ“๊ธ€ ์ˆ˜์ง‘ํ•œ ๊ฒฐ๊ณผ๋„ ์–ธ์  ๊ฐ€ ํฌ์ŠคํŒ… ์˜ˆ์ •...

์‹œ๊ฐ„์ด ๋  ๋•Œ, ์ฝ”๋“œ๋ฅผ ์ข€ ๋” ์ •๋ฆฌํ•ด์„œ ๋งค์ฃผ ์ž๋™์œผ๋กœ ๋Œ์•„๊ฐ€๊ฒŒ ๋งŒ๋“ค์–ด ๋ด์•ผ๊ฒ ๋‹ค.

๋!

profile
5๋…„์ฐจ, ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ & ๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€ & ML ์—”์ง€๋‹ˆ์–ด ์‚ฌ์ด์˜ ์–ด๋”˜๊ฐ€

0๊ฐœ์˜ ๋Œ“๊ธ€