베스트도전 웹툰의 정식연재 승격 확률 예측 - 2. 회차 크롤링

조은진·2023년 2월 5일
0

이전에 수집한 웹툰들의 리스트를 바탕으로 각 초반 3화, 후반 3화의 정보를 크롤링할 것이다.

패키지 import

import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.common.alert import Alert
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

수집을 진행하기 전에, webdriver에 url을 넣어주어야 한다.

wd = webdriver.Chrome('C:/chromedriver.exe')
url = "https://comic.naver.com/bestChallenge/list?titleId=" + str(webtoon.loc[i,'titleId'])
wd.get(url)

❤️ 하트수

하트수가 나와있는 버튼을 우클릭 -> 검사 하여 개발자도구를 연다.

'9,210'에서 우클릭 -> Copy -> Copy selector 하여 find_element 함수의 인자로 넣을 것이다.

예외) 시리즈 작품

위 웹툰처럼, 시리즈 작품은 기존 하트수가 있는 버튼이 '시리즈에서 보기' 버튼에 의해 한 칸 오른쪽에 있었다. 따라서 기존의 방법처럼 Copy selector 를 진행하여 따로 입력하였다.
시리즈 작품의 여부는 개발자 도구에서 '시리즈에서 보기' 버튼의 유무로 판단하였다.

try: # 시리즈 작품의 하트수 추출
	series = '시리즈' in wd.find_element(By.CSS_SELECTOR, '#content > div.comicinfo > div.detail > ul > li:nth-child(4) > a > span').text
    heart = wd.find_element(By.CSS_SELECTOR,'#content > div.comicinfo > div.detail > ul > li:nth-child(6) > div > a > em').text
except: # 시리즈가 아닌 작품의 하트수 추출
    heart = wd.find_element(By.CSS_SELECTOR,'#content > div.comicinfo > div.detail > ul > li:nth-child(5) > div > a > em').text

📚 type장르

에피소드/옴니버스/스토리 의 type 장르는 class = "on" 인 클래스로 확인할 수 있다. 마찬가지로 find_element를 통해 정보를 추출하였다.

typeGenre = wd.find_element(By.CSS_SELECTOR,'#content > div.snb > ul > li.on').text

✔️ 회차별 별점, 별점참여수, 등록일, 조회수

위 사진에서 find_element를 이용해 별점, 별점참여수, 등록일, 조회수를 수집하였다.

star = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
starPar = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
views(-1) = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
day(-1) = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text

💬 회차별 댓글

전체 댓글을 가져오기 힘들 것 같아 'BEST댓글'을 최대 5개로 수집하였다. 각 댓글의 좋아요수, 싫어요수도 중요한 변수가 될 것 같아 같이 수집하였다.

comment = pd.DataFrame(columns= ['titleId','isPublic','comment','like','hate'])
# 댓글창 열기
wd.switch_to.frame('commentIframe') 
# 'BEST댓글'이 없는 경우 == '전체댓글'로 설정되어 있는 경우
if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
	wd.switch_to.default_content(); wd.back(); continue
# 'BEST댓글' 수집
commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)
for j in range(commentNum): #댓글수가 5개 미만인 경우 고려
	comment.loc[len(comment)] = [titleId, isPublic, 
    							wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
wd.switch_to.default_content(); wd.back()
            

💻 전체코드

변수에서 숫자는 각 회차의 정보를 의미한다. 예를 들어 star1은 1화, star2는 2화, ..., star(-1)은 가장 최신화 이다. 제목에 '공지'가 있는 회차는 수집 목적에 맞지 않다고 판단하여 해당 회차는 건너뛰었다.
수집 순서는 가장 최신화 -> 두번째 최신화 -> 세번째 최신화 -> '첫화보기'를 클릭하여 첫화 -> 두번째 화 -> 세번째 화 이다.
코드의 중간중간에(url 이동 후, 댓글창 이동 후 등) time.sleep() 함수를 이용하여, 에러를 줄일 수 있다.

# import
import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.common.alert import Alert
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 각 웹툰별 정보 크롤링
comment = pd.DataFrame(columns= ['titleId','isPublic','comment','like','hate'])
alertList = [] # 경고창 때문에 첫번째 회차가 안 열리는 경우 : 따로 크롤링 진행!
wd = webdriver.Chrome('C:/chromedriver.exe')

for i in range(len(webtoon)):
	# url이동
    url = "https://comic.naver.com/bestChallenge/list?titleId=" + str(webtoon.loc[i,'titleId'])
    wd.get(url)
    
    # 하트수
    try: # 시리즈 작품
        series = '시리즈' in wd.find_element(By.CSS_SELECTOR, '#content > div.comicinfo > div.detail > ul > li:nth-child(4) > a > span').text
        while True: 
        	# url 이동 후 바로 버튼이 생성되지 않아 반복문 이용. time.sleep()을 이용해도 좋다.
            try: 
                webtoon.loc[i,'heart'] = wd.find_element(By.CSS_SELECTOR,'#content > div.comicinfo > div.detail > ul > li:nth-child(6) > div > a > em').text
                break
            except: pass
    except: # 시리즈 작품이 아닌 작품
        while True:
            try: 
                webtoon.loc[i,'heart'] = wd.find_element(By.CSS_SELECTOR,'#content > div.comicinfo > div.detail > ul > li:nth-child(5) > div > a > em').text
                break
            except: pass
    
    # 에피소드/옴니버스/스토리
    webtoon.loc[i,'typeGenre'] = wd.find_element(By.CSS_SELECTOR,'#content > div.snb > ul > li.on').text
    
    pageCnt = len(wd.find_elements(By.CSS_SELECTOR, '#content > table > tbody > tr > td.title > a')) # 첫 화면에서의 회차 수 <= 10
    ## 3화 이하인 웹툰은 정보가 너무 적다고 판단하여 수집하지 않음.
    if pageCnt <= 3: continue 
    
    tryCnt = 0 # 수집을 도전한 회차 수
    dataCnt = 0 # 수집한 회차 수
    isalert = False # 경고창 팝업 여부
    
    for no in range(0,pageCnt):
        if (dataCnt == 3) or (tryCnt == pageCnt): break
        tryCnt += 1
        
        ## 최근 날짜 맞추기위해 실행 
        if (no == 0) and (pd.to_datetime(wd.find_elements(By.CSS_SELECTOR, '#content > table > tbody > tr > td.num')[0].text) > pd.to_datetime('2022.12.02')): continue 
        
        ## '공지' 글자가 있는 회차 스킵
        title = wd.find_elements(By.CSS_SELECTOR,'#content > table > tbody > tr > td.title > a')[no].text
        if '공지' in title: continue
    
    	## 가장 최신 화 클릭
        wd.find_elements(By.CSS_SELECTOR,'#content > table > tbody > tr > td.title > a')[no].click()
        dataCnt += 1
        
        if dataCnt == 1: ## 최근 1화
            webtoon.loc[i,'star(-1)'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
            webtoon.loc[i,'starPar(-1)'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
            webtoon.loc[i,'views(-1)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
            webtoon.loc[i,'day(-1)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
      
            ## 댓글
            wd.switch_to.frame('commentIframe')
            if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
                wd.switch_to.default_content(); wd.back(); continue
            commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
            for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
                comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
            wd.switch_to.default_content(); wd.back()
            
        elif dataCnt == 2: ## 최근 2화
            webtoon.loc[i,'star(-2)'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
            webtoon.loc[i,'starPar(-2)'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
            webtoon.loc[i,'views(-2)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
            webtoon.loc[i,'day(-2)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
    
            wd.switch_to.frame('commentIframe') # 댓글 크롤링
            if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
                wd.switch_to.default_content(); wd.back(); continue
            commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
            for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
                comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
            wd.switch_to.default_content(); wd.back()
            
        elif dataCnt == 3: ## 최근 3화
            webtoon.loc[i,'star(-3)'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
            webtoon.loc[i,'starPar(-3)'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
            webtoon.loc[i,'views(-3)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
            webtoon.loc[i,'day(-3)'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
    
            wd.switch_to.frame('commentIframe') # 댓글 크롤링
            if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
                wd.switch_to.default_content(); continue
            commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
            for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
                comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                             wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
            wd.switch_to.default_content()
            
    if (tryCnt == pageCnt): continue
    
    ## 1화
    try:
        tryCnt += 1
        wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.comicinfo > div.detail > ul > li:nth-child(2) > a').click()
        alert = Alert(wd); alert.accept()
        alertList += [i]
        continue
    except: pass
    
    title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    while ('공지' in title) and (tryCnt < pageCnt):
        tryCnt += 1
        wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > div > span.next > a').click()
        title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    
    webtoon.loc[i,'star1'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
    webtoon.loc[i,'starPar1'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
    webtoon.loc[i,'views1'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
    webtoon.loc[i,'day1'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
      
    wd.switch_to.frame('commentIframe') # 댓글 크롤링
    time.sleep(1.5)
    if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
        wd.switch_to.default_content(); continue
    commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
    for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
        comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
    wd.switch_to.default_content()
    if tryCnt == pageCnt: continue
    
    ## 2화 
    tryCnt += 1
    wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > div > span.next > a').click()
    title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    while ('공지' in title) and (tryCnt < pageCnt):
        tryCnt += 1
        wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > div > span.next > a').click()
        title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    
    webtoon.loc[i,'star2'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
    webtoon.loc[i,'starPar2'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
    webtoon.loc[i,'views2'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
    webtoon.loc[i,'day2'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
    
    wd.switch_to.frame('commentIframe') # 댓글 크롤링
    time.sleep(0.5)
    if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
        wd.switch_to.default_content(); continue
    commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
    for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
        comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
    wd.switch_to.default_content()
    if tryCnt == pageCnt: continue
    
    ## 3화
    tryCnt += 1
    wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > div > span.next > a').click()
    title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    while ('공지' in title) and (tryCnt < pageCnt):
        tryCnt += 1
        wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > div > span.next > a').click()
        title = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.view > h3').text
    
    webtoon.loc[i,'star3'] = wd.find_element(By.CSS_SELECTOR, '#topPointTotalNumber').text
    webtoon.loc[i,'starPar3'] = wd.find_element(By.CSS_SELECTOR, '#topTotalStarPoint > span.pointTotalPerson > em').text
    webtoon.loc[i,'views3'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(4)').text
    webtoon.loc[i,'day3'] = wd.find_element(By.CSS_SELECTOR, '#sectionContWide > div.tit_area > div.vote_lst > dl.rt > dd:nth-child(2)').text
    
    wd.switch_to.frame('commentIframe') # 댓글 크롤링
    if wd.find_element(By.CSS_SELECTOR,'#cbox_module > div > div.u_cbox_sort > div.u_cbox_sort_option > div > ul > li.u_cbox_sort_option_wrap.u_cbox_sort_option_on').text == '전체댓글':
        wd.switch_to.default_content(); continue
    commentNum = min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5) #댓글수가 5개 미만인 경우 고려
    for j in range(min(len(wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')),5)): 
        comment.loc[len(comment)] = [webtoon.loc[i,'titleId'],webtoon.loc[i,'isPublic'],
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_contents')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_recomm')[j].text,
                                     wd.find_elements(By.CLASS_NAME, 'u_cbox_cnt_unrecomm')[j].text]
    wd.switch_to.default_content()
    
# 웹툰 데이터 csv 파일로 저장
webtoon.to_csv('mywebtoon_data.csv', index= False)
comment.to_csv('mycomment_data.csv', index= False)

코딩 실력이 부족하여 코드가 상당히 길지만, 잘 정리하여 간략한 코드를 만들어보고 싶다..

profile
열심히 노력하는 학생!

0개의 댓글