EDA - 웹 데이터 분석 1 : 데이터 취업 스쿨 스터디 노트 12/4

slocat·2023년 12월 4일

Beautiful Soup EDA Request find requests select urllib urlopen

start-data

목록 보기

32/75

1. BeautifulSoup 기초

1-1. 시작

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

from bs4 import BeautifulSoup

page = open("파일명", "r").read()
soup = BeautifulSoup(page, "html.parser")

# 출력
page
print(page)
print(soup.prettify())

1-2. 태그 찾기

# 태그 확인 - 처음 발견한 태그만 출력
soup.head
soup.p

# find() - 처음 발견한 태그만 출력
soup.find("p")
soup.find("p", class_="inner-text first-item")
soup.find("p", {"class": "inner-text first-item"})
soup.find("p", {"class": "inner-text first-item", "id": "first"})

# find_all() - 여러 개 찾을 때, 리스트 형태로 반환⭐
soup.find_all("p")
soup.find_all(class_="outer-text")
soup.find_all("p")[0]	# 첫 번째 p태그에 접근할 때

1-3. 텍스트 추출하기

print(soup.find_all("p")[0].text)
print(soup.find_all("p")[0].string)
print(soup.find_all("p")[0].get_text())

>>>
        Happy PinkWink.
        PinkWink
None
        Happy PinkWink.
        PinkWink

string은 한 개의 태그에만 텍스트가 들어 있을 때 인식해서 가져온다.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

# 1
html = """
<p class="inner-text first-item" id="first">
Happy PinkWink.
<a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>
</p>
"""

print(f'text: {soup.find("p").text}')
print(f'string: {soup.find("p").string}')

>>>
text: 
Happy PinkWink.
PinkWink

string: None

# 2
html = """
<p class="inner-text first-item" id="first">Happy PinkWink.</p>
"""

print(f'text: {soup.find("p").text}')
print(f'string: {soup.find("p").string}')

>>>
text: Happy PinkWink.
string: Happy PinkWink.

1-4. 링크 추출하기

# a 태그 찾기
links = soup.find_all("a")

# 링크 추출 (a 태그에서 href 속성에 있는 값)
# 아래 두 개의 코드는 같은 결과를 반환
links[0].get("href")
links[0]["href"]

1-5. 반복문 함께 사용하기

find_all()이 반환하는 리스트를 for문에 넣어서 요소를 하나씩 뽑을 수 있다.

for each_tag in soup.find_all("p"):
    print(each_tag.text)

links = soup.find_all("a")
for each in links:
    href = each.get("href")
    text = each.get_text()
    print(f'{text} ➡ {href}')

2. 예제 : 크롬 개발자 도구 이용하기

2-1. 네이버 증권 - 미국USD 환율 정보

from urllib.request import urlopen

url = "https://finance.naver.com/marketindex/"
page = urlopen(url) # 변수명 response, res, ...
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify)

# 환율 정보는 span 태그의 class="value"에 있다.
# 아래 세 개의 코드는 같은 결과를 반환
soup.find_all("span", "value")
soup.find_all("span", class_="value")
soup.find_all("span", {"class": "value"})

# 원하는 정보를 잘 찾았는지 개수 확인
len(soup.find_all("span", "value"))

🍳 더 알아보기
https://developer.mozilla.org/ko/docs/Web/HTTP/Status

# 주로 변수명 response, res, ...
response = urlopen(url)

response		# <http.client.HTTPResponse at 0x276e2c045e0>
response.status # 200 ➡ "정상적으로 요청하고 응답 받았다" (http 상태 코드)

2-2. 네이버 증권 - 환전 고시 환율 정보

✔ urllib.request와 requests 모듈의 차이점

import requests

url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
response	# <Response [200]>

soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

✔ find_all과 select의 차이점

# select
exchangeList = soup.select("#exchangeList > li") 

title = exchangeList[0].select_one(".h_lst").text
exchange = exchangeList[0].select_one(".value").text
change = exchangeList[0].select_one(".change").text

# ⭐class="head_info point_up" ➡ 띄어쓰기 = 클래스 속성값 2개
updown = exchangeList[0].select_one(".head_info.point_up > .blind").text

baseUrl = "https://finance.naver.com"
link = baseUrl + exchangeList[0].select_one("a").get("href")

# find
findMethod = soup.find_all("li", class_="on")

title = findMethod[0].find("h3", "h_lst").text
exchange = findMethod[0].find("span", "value").text
change = findMethod[0].find("span", "change").text
updown = findMethod[0].find_all("span", "blind")[2].text
link = findMethod[0].find("a").get("href")

파이썬 파일로 만들어서 프로그램처럼 동작하게 할 수도 있다.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
exchangeList = soup.select("#exchangeList > li")

exchange_datas = []
baseUrl = "https://finance.naver.com"

for item in exchangeList:
    data = {
        "title": item.select_one(".h_lst").text,
        "exchange": item.select_one(".value").text,
        "change": item.select_one(".change").text,
        "updown": item.select_one(".head_info.point_up > .blind").text,
        "link": baseUrl + item.select_one("a").get("href")
    }
    exchange_datas.append(data)

df = pd.DataFrame(exchange_datas)  
df.to_excel("./naverfinance.xlsx")

2-3. 위키백과 문서 정보 가져오기

✔ 웹 주소는 utf-8 기법으로 인코딩되어 있다.

https://ko.wikipedia.org/wiki/여명의_눈동자
➡
# 인코딩이 깨져서 나옴
https://ko.wikipedia.org/wiki/%EC%97%AC%EB%AA%85%EC%9D%98_%EB%88%88%EB%8F%99%EC%9E%90

urllib 라이브러리 - parse 모듈 - quote 함수를 이용해서 글자를 인코딩할 수 있다.

import urllib
from urllib.request import urlopen, Request

html = "https://ko.wikipedia.org/wiki/{search_words}"
req = Request(html.format(search_words=urllib.parse.quote("여명의_눈동자")))
response = urlopen(req)
soup = BeautifulSoup(response, "html.parser")

반복문을 이용해서 원하는 ul 위치 찾기
강의 내용과는 조금 다르게 인덱스 35번에 있다.

n = 0
for each in soup.find_all("ul"):
    print("****" + str(n) + "****")
    print(each.get_text())
    n += 1

soup.find_all("ul")[35].text.strip().replace("\n", "").replace("\xa0", "")

>>>
'채시라: 윤여옥 역 (아역: 김민정)박상원: 장하림(하리모토 나츠오) 역 (아역: 김태진)최재성: 최대치(사카이) 역 (아역: 장덕수)'