๐Ÿ–ฅ๏ธ[Python] 7-1-1. ์›นํฌ๋กค๋ง (requests ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, beautifulsoup4 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ)

thisk336ยท2023๋…„ 6์›” 12์ผ
0

Python

๋ชฉ๋ก ๋ณด๊ธฐ
9/17
post-thumbnail

requests ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • requests๋Š” ํŒŒ์ด์ฌ์—์„œ HTTP๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์“ฐ์—ฌ์ง€๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ์˜ค๋Š˜๋‚  ๊ฐ€์žฅ ๋งŽ์ด ๋‹ค์šด๋กœ๋“œ ๋˜๋Š” Python ํŒจํ‚ค์ง€ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.
# requests ํŒจํ‚ค์ง€ ์„ค์น˜
!pip install requests

requests ํŒจํ‚ค์ง€์˜ ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜

# ์–ด๋–ค ๋ฐฉ์‹์˜ HTTP ์š”์ฒญ์„ ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ์„œ ํ•ด๋‹นํ•˜๋Š” ์ด๋ฆ„์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
response = requests.get() # get ๋ฐฉ์‹
response = requests.post() # post ๋ฐฉ์‹
response = requests.put() # put ๋ฐฉ์‹
response = requests.delete() # delect ๋ฐฉ์‹

beautifulsoup4 ํŒจํ‚ค์ง€

  • beautifulsoup4(bs4)๋Š” HTML source๋ฅผ tag๋ณ„ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๊ฒŒ parse tree ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž…๋‹ˆ๋‹ค.
  • bs4๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์†์‰ฝ๊ฒŒ HTML source์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • find, find_all ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜๋ฉด ์›ํ•˜๋Š” tag์™€ ์†์„ฑ์— ๋งž๋Š” ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# page source๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
import requests

page_no = 1
page_url = f"https://finance.naver.com/sise/sise_index_day.naver?code=KPI200&page={page_no}"

source = requests.get(page_url).text
source

๊ฐ€์ ธ์˜จ source๋ฅผ ํ™•์ธํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

'<html lang="ko">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">\n<title>๋„ค์ด๋ฒ„ ์ฆ๊ถŒ</title>\n\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/common.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/layout.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/main.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock2.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock3.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/world.css">\n</head>

์ด๊ฒƒ์„ bs4 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด parse tree๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด

# beautifulsoup4๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.
import bs4

# BeautifulSoup ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ถˆ๋Ÿฌ์˜จ html source๋ฅผ "lxml" parser๋กœ parsing ํ•ฉ๋‹ˆ๋‹ค.
source = bs4.BeautifulSoup(source)

# bs4์˜ prettify() ํ•จ์ˆ˜๋Š” HTML source๋ฅผ tab์„ ๊ธฐ์ค€์œผ๋กœ "์ด์˜๊ฒŒ" ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
print(source.prettify())
<html lang="ko">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   ๋„ค์ด๋ฒ„ ์ฆ๊ถŒ
  </title>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/common.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/layout.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/main.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock2.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/newstock3.css" rel="stylesheet" type="text/css"/>
  <link href="https://ssl.pstatic.net/imgstock/static.pc/20230519195543/css/world.css" rel="stylesheet" type="text/css"/>
 </head>
  • ๋‹ค์Œ๊ณผ ๊ฐ™์ด web crawling์„ ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ์ •๋ณด๋“ค์„ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€