[빅 리더 AI] 크롤러 1일차

KrTeaparty·2022년 7월 5일

빅 리더 AI

목록 보기

2/7

웹 크롤링의 원리

크롤링 과정
1. Selenium으로 특정 웹 페이지를 크롤링하라고 명령한다.
2. Selenium은 소스코드에 지정된 Web Driver를 실행하여 웹 페이지에 접속한다.
3. 접속한 웹 페이지를 HTML 형태로 로컬로 가져완다.
4. 수집된 HTML 전체 코드에서 Beautiful Soup를 사용하여 필요한 부분만 선택한다.
5. 선택한 부분을 원하는 형식의 파일로 저장한다.

검색창에서 검색어를 입력 후 자동 검색하기 기능 구현

과정
1. 웹 페이지 접속
2. 검색창 찾기
3. 검색창 클릭
4. 검색어 입력
5. 조회

예제: 네이버 사이트의 검색창에 키워드를 입력한 후 검색 실행

# 모듈 로딩
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
import time          

# 키워드 준비
print("=" *100)
print(" 이 크롤러는 연습문제용 웹크롤러입니다.")
print("=" *100)
query_txt = "빅데이터"
print("\n")

# 크롬 드라이버 설정 및 웹 페이지 열기
s = Service("../chrome_driver/chromedriver.exe")
driver = webdriver.Chrome(service=s)

url = 'https://www.naver.com/'
driver.get(url)
time.sleep(3)
driver.maximize_window()

# 검색어 입력 후 조회하기
element = driver.find_element(By.ID,'query')
driver.find_element(By.ID,'query').click( )
element.send_keys(query_txt)
element.send_keys("\n")

이 코드는 기본 틀이다.
여기서 몇몇 부분이 추가되거나 수정될 수는 있지만 기본적으로 똑같이 찾고,

중요한 부분만 확인한다.
우선 time.sleep()은 여러 방법으로 쓸 수 있다. 예를 들면 로딩을 기다리거나, 크롤러가 아닌척 연기를 할 때 등의 상황에 사용할 수 있다.

# 검색어 입력 후 조회하기
element = driver.find_element(By.ID,'query')
driver.find_element(By.ID,'query').click( )
element.send_keys(query_txt)
element.send_keys("\n")

find_element()는 element를 찾아주는 역할이다. Element는 웹페이지에서 클릭하거나 값을 입력하는 등의 작업을 하는 대상을 말한다.

find_element() 사용하는 방법

find_element(By.ID, 'id값')
find_element(By.NAME, 'name값')
find_element(By.XPATH, 'xpath값')
find_element(By.LINK_TEXT, 'text값')
find_element(By.TAG_NAME, 'tag name값')

추가적으로 주의할 점

반응형 웹페이지라면 창 크기에 따라 똑같은 부분임에도 태그가 달라지는 경우가 있다.
팝업이 있으면 처리해주는 것이 좋다.
정기적으로 xpath가 바뀌어서 어제 돌아갔던 코드가 돌아가지 않는 경우가 있다.
크롤링을 막기 위해 블락을 당할 수 있으니 조심해야 한다.

아래는 팝업 창을 닫는 코드의 예시이다.

main = driver.window_handles 
for handle in main: 
   if handle != main[0]: 
        driver.switch_to_window(handle) 
        driver.close()

Beautiful Soup

Beautiful Soup는 웹 페이지의 HTML 태그를 가져오는 역할이며, find()/find_all()/select() 함수를 사용한다.

find()

find()함수는 주어진 조건을 만족하는 첫 번째 태그 값만 가져온다.
그렇기 때문에 전체 페이지에 특정 조건을 만족하는 것이 있는지 확인할 때, 첫 번째 값만 필요할 때 사용할 수 있다.

#Beautiful Soup 예제 1
from bs4 import BeautifulSoup
ex1 = '''
<html>
    <head>
        <title> HTML 연습 </title>
    </head>
    <body>
        <p align="center"> text 1 </p>
        <img src="c:\\temp\\image\\솔개.png">
    </body>
<html> '''

soup = BeautifulSoup(ex1, 'html.parser')
print( soup.find('title') )
print( soup.find('p') )
print( soup.find('p', align="center") )

<title> HTML 연습 </title>
<p align="center"> text 1 </p>
<p align="center"> text 1 </p>

find_all()

find_all() 함수는 해당 태그가 여러 개 있을 경우 모두 리스트 형식으로 가져온다.

#Beautiful Soup 예제 3
from bs4 import BeautifulSoup
ex1 = '''
<html>
    <head>
        <title> HTML 연습 </title>
    </head>
    <body>
        <p align="center"> text 1 </p>
        <p align="center"> text 2 </p>
        <p align="center"> text 3 </p>
        <img src="c:\\temp\\image\\솔개.png">
    </body>
<html> '''

soup = BeautifulSoup(ex1, 'html.parser')
print( soup.find_all('p') )
print( soup.find_all('p')[0] )
print( soup.find_all('p')[1] )
print( soup.find_all('p')[2] )
print( soup.find_all(['p', 'img']) )

[<p align="center"> text 1 </p>, <p align="center"> text 2 </p>, <p align="center"> text 3 </p>]
<p align="center"> text 1 </p>
<p align="center"> text 2 </p>
<p align="center"> text 3 </p>
[<p align="center"> text 1 </p>, <p align="center"> text 2 </p>, <p align="center"> text 3 </p>, <img src="c:\temp\image\솔개.png"/>]

select()

select() 함수는 css_selector를 활용해서 원하는 태그를 찾는다.

ex2='''
<html>
    <head>
        <h1> 사야할 과일
    </head>
    <body>
        <h1> 시장가서 사야할 과일 목록
            <div><p id='fruit1' class='name1' title='바나나'> 바나나
                <span class='price'> 3000원 </span>
                <span class='count'> 10개 </span>
                <span class='store'> 바나나가게 </span>
                <a href='https://www.banana.com'> banana.com </a>
                </p>
            </div>
             <div><p id='fruit2' class='name2' title='체리'> 체리
                <span class='price'> 100원 </span>
                <span class='count'> 50개 </span>
                <span class='store'> 체리가게 </span>
                <a href='https://www.cherry.com'> cherry.com </a>
                </p>
            </div>
             <div><p id='fruit3' class='name3' title='오렌지'> 오렌지
                <span class='price'> 500원 </span>
                <span class='count'> 20개 </span>
                <span class='store'> 오렌지가게 </span>
                <a href='https://www.orange.com'> banana.com </a>
                </p>
            </div>
        </body>
    </html> '''

select('태그이름')

soup2 = BeautifulSoup(ex2 , 'html.parser')

soup2.select('p')

[<p class="name1" id="fruit1" title="바나나"> 바나나
                 <span class="price"> 3000원 </span>
 <span class="count"> 10개 </span>
 <span class="store"> 바나나가게 </span>
 <a href="https://www.banana.com"> banana.com </a>
 </p>,
 <p class="name2" id="fruit2" title="체리"> 체리
                 <span class="price"> 100원 </span>
 <span class="count"> 50개 </span>
 <span class="store"> 체리가게 </span>
 <a href="https://www.cherry.com"> cherry.com </a>
 </p>,
 <p class="name3" id="fruit3" title="오렌지"> 오렌지
                 <span class="price"> 500원 </span>
 <span class="count"> 20개 </span>
 <span class="store"> 오렌지가게 </span>
 <a href="https://www.orange.com"> banana.com </a>
 </p>]

select('.클래스명')

soup2.select(' .name1 ')

[<p class="name1" id="fruit1" title="바나나"> 바나나
                 <span class="price"> 3000원 </span>
 <span class="count"> 10개 </span>
 <span class="store"> 바나나가게 </span>
 <a href="https://www.banana.com"> banana.com </a>
 </p>]

select(' 상위태그 > 하위태그 > 하위태그 ')

">" 옆에 띄어쓰기는 반드시 있어야 한다.

soup2.select(' div > p > span')
soup2.select(' div > p > span')[0]

[<span class="price"> 3000원 </span>,
 <span class="count"> 10개 </span>,
 <span class="store"> 바나나가게 </span>,
 <span class="price"> 100원 </span>,
 <span class="count"> 50개 </span>,
 <span class="store"> 체리가게 </span>,
 <span class="price"> 500원 </span>,
 <span class="count"> 20개 </span>,
 <span class="store"> 오렌지가게 </span>]
 <span class="price"> 3000원 </span>

select('상위태그.클래스이름 > 하위태그.클래스이름')

soup2.select(' p.name1 > span.store ')

[<span class="store"> 바나나가게 </span>]

select('#아이디명')

soup2.select(' #fruit1')

[<p class="name1" id="fruit1" title="바나나"> 바나나
                 <span class="price"> 3000원 </span>
 <span class="count"> 10개 </span>
 <span class="store"> 바나나가게 </span>
 <a href="https://www.banana.com"> banana.com </a>
 </p>]

select('#아이디명 > 태그명.클래스명')

soup2.select(' #fruit1 > span.store')

[<span class="store"> 바나나가게 </span>]

select('태그명[속성1=값1]')

soup2.select('a[href]')
soup2.select('a[href]')[0]

[<a href="https://www.banana.com"> banana.com </a>,
 <a href="https://www.cherry.com"> cherry.com </a>,
 <a href="https://www.orange.com"> banana.com </a>]
<a href="https://www.banana.com"> banana.com </a>

태그 뒤의 텍스트만 추출하기

string

txt = soup.fing('p')
txt.string

' text 1 '

get_text()

txt3 = soup.find_all('p')
for i in txt3:
	print(i.get_text())

text 1
text 2
text 3

예제: 네이버 사이트에서 키워드로 검색한 후 "뉴스" 카테고리를 선택하여 조회된 기사들을 수집하여 txt 형식으로 저장

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import sys
import time          

query_txt = "빅데이터"
print("\n")

# 크롬 드라이버 설정 및 웹 페이지 열기
s = Service("../chrome_driver/chromedriver.exe")
driver = webdriver.Chrome(service=s)

url = 'https://www.naver.com/'
driver.get(url)
time.sleep(5)
driver.maximize_window()

# 키워드로 검색
element = driver.find_element(By.ID, 'query')
driver.find_element(By.ID,'query').click( )
element.send_keys(query_txt)
element.send_keys("\n")

# 뉴스 선택
driver.find_element(By.LINK_TEXT, '뉴스').click()
time.sleep(2)

# BeautifulSoup로 본문 내용만 추출
html_1 = driver.page_source
soup_1 = BeautifulSoup(html_1, 'html.parser')

content_1 = soup_1.find('div','group_news').find_all('li')
for i in content_1:
    print(i.get_text().replace('\n', ' ').strip())
    print('\n')
    
# 표준 출력 방향을 바꾸어 txt 파일에 저장
f_name = './result/Chap2_practice1.txt'
orig_stdout = sys.stdout
file = open(f_name, 'a', encoding='UTF-8')
sys.stdout = file

for i in content_1:
    print(i.get_text().replace('\n',''))
    
file.close()

sys.stdout = orig_stdout

print('크롤링 정상 완료')

By.LINK_TEXT는 메뉴에 있는 것을 클릭할 때 유용하다.

# 뉴스 선택
driver.find_element(By.LINK_TEXT, '뉴스').click()
time.sleep(2)

반복문으로 읽어온 것들의 텍스트 부분만 추출한다.

content_1 = soup_1.find('div','group_news').find_all('li')
for i in content_1:
    print(i.get_text().replace('\n', ' ').strip())
    print('\n')

f.write()를 이용하는 방식이 아닌 리다이렉션을 이용한 파일 입출력이다.

표준 출력을 다룰 때는 신중해야 한다.

f_name = './result/Chap2_practice1.txt'
orig_stdout = sys.stdout
file = open(f_name, 'a', encoding='UTF-8')
sys.stdout = file

for i in content_1:
    print(i.get_text().replace('\n',''))
    
file.close()

sys.stdout = orig_stdout

KrTeaparty

데이터를 접하는 중

이전 포스트

[빅 리더 AI] 스터디 기록 (Dacon 청와대)

다음 포스트