๐Ÿ–ฅ๏ธ[Python] 7-1-3. ์›นํฌ๋กค๋ง (์ฃผ์‹ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ)

thisk336ยท2023๋…„ 6์›” 12์ผ
0

Python

๋ชฉ๋ก ๋ณด๊ธฐ
11/17
post-thumbnail

์ถœ์ฒ˜ : ๋„ค์ด๋ฒ„ ์ฆ๊ถŒ : ์ฝ”์Šคํ”ผ200

์ฝ”์Šคํ”ผ200 ์ฃผ๊ฐ€ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

  • ์•ž์„œ ์†Œ๊ฐœํ•œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ฝ”์Šคํ”ผ200์˜ ์ฒด๊ฒฐ๊ฐ€์™€ ์ฒด๊ฒฐ ๋‚ ์งœ๋ฅผ ์ฒซ ํŽ˜์ด์ง€๋ถ€ํ„ฐ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€๊นŒ์ง€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด๋ด…์‹œ๋‹ค.
  • ํ•ด๋‹น ๋งํฌ์—์„œ๋Š” ์†Œ์Šค ํ”„๋ ˆ์ž„์„ ๋”ฐ๋กœ ๋งŒ๋“ค๊ณ  ๊ทธ ํ”„๋ ˆ์ž„์„ ํŽ˜์ด์ง€์— ๋ณด์—ฌ์ฃผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ํ”„๋ ˆ์ž„์˜ ์†Œ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ํฌ๋กค๋งํ•˜๋Š” ๊ฒƒ์ด ๋”์šฑ ์ข‹์Šต๋‹ˆ๋‹ค.

๋‚ ์งœ ๋ฐ์ดํ„ฐ ์ถ”์ถœ

import bs4
import requests

# ํ•ด๋‹น ํ”„๋ ˆ์ž„์˜ url์„ ๋ณต์‚ฌํ•˜์—ฌ page_url์— ์ €์žฅํ•˜๊ณ  f-string์„ ํ™œ์šฉํ•ด page_no์„ formatํ•œ๋‹ค.
page_no = 1
page_url = f"https://finance.naver.com/sise/sise_index_day.naver?code=KPI200&page={page_no}"

# ํ•ด๋‹น url์—์„œ response๋ฅผ ๋ฐ›์•„ ๊ทธ๊ฒƒ์„ text๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  source์— ์ €์žฅํ•œ ๋’ค
source = requests.get(page_url).text

# bs4 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๋ณด๊ธฐ ์‰ฝ๊ฒŒ source๋ฅผ ๋ณ€ํ™˜ํ•œ๋‹ค.
source = bs4.BeautifulSoup(source)

# ๋‚ ์งœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ๋Š” td class๋ฅผ ์ฐพ์•„์„œ ๋‚ ์งœ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•˜์—ฌ data_list์— ์ €์žฅํ•œ๋‹ค.
dates = source.find_all('td', class_="date")
date_list = []

for date in dates:
    date_list.append(date.text)

์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ

# ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ๋Š” td class๋ฅผ ์ฐพ์•„์„œ ์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•˜์—ฌ price_list์— ์ €์žฅํ•œ๋‹ค.
prices = source.find_all('td', class_="number_1")
price_list = []

# ์ „์ผ๋น„/๋“ฑ๋ฝ๋ฅ /๊ฑฐ๋ž˜๋Ÿ‰/๊ฑฐ๋ž˜๋Œ€๊ธˆ ๋“ค๋„ ๊ฐ™์€ ํƒœ๊ทธ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— slicing์„ ์ด์šฉํ•˜์—ฌ ์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ๋งŒ ํ‘œ์‹œํ•˜๋„๋ก ํ•œ๋‹ค.
for price in prices[::4]:
    price_list.append(price.text)

price_list

๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ์ฐพ๊ธฐ

  • ์œ„์™€ ๊ฐ™์ด ํ•œ ํŽ˜์ด์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ์„ for๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€๊นŒ์ง€ ์ถ”์ถœํ•˜๋ฉด ๋ชจ๋“  ์ฃผ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€์˜ ๋ฒˆํ˜ธ๋ฅผ ์ฐพ์œผ๋ ค๋ฉด ํ•ด๋‹น tag๋ฅผ ์ฐพ์•„์„œ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€์˜ ๋ฒˆํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด์•ผํ•œ๋‹ค.
<td class="pgRR">
	<a href="/sise/sise_index_day.naver?code=KPI200&amp;page=719">๋งจ๋’ค
		<img src="https://ssl.pstatic.net/static/n/cmn/bu_pgarRR.gif" width="8" height="5" alt="" border="0">
	</a>
</td>
  • ์œ„ HTML source์—์„œ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€๋Š” 719์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ˆซ์ž๋งŒ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
# td tag์ค‘์— class๊ฐ€ pgRR์ธ ํƒœ๊ทธ๋ฅผ ์ฐพ์•„์„œ, ๊ทธ ํ•˜์œ„์— ์žˆ๋Š” a tag์˜ href ์†์„ฑ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค.
last_url = source.find_all("td", class_="pgRR")[0].find_all("a")[0]["href"]

#'/sise/sise_index_day.naver?code=KPI200&page=719' ์—์„œ ๋งˆ์ง€๋ง‰ 3๊ธ€์ž๋ฅผ ๋•Œ์–ด๋‚ด๋ฉด ์›ํ•˜๋Š” ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€ ์ˆซ์ž๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.
last_page = int(last_url.split('&page=')[-1])

ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์„œ ๊ตฌํ˜„ํ•˜๊ธฐ

import requests
import bs4
import pandas as pd

# ๋ชจ๋“  ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
date_list = []
price_list = []
page_no = 1

page_url = f"https://finance.naver.com/sise/sise_index_day.naver?code=KPI200&page={page_no}"
source = requests.get(page_url).text # 
source = bs4.BeautifulSoup(source)
source.prettify()

# ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ์ฐพ๊ธฐ
last_url = source.find_all('td', class_="pgRR")[0].find_all('a')[0]['href']
last_page = int(last_url.split('&page=')[-1])

# for๋ฌธ์„ ์ด์šฉํ•˜์—ฌ ์ฒซ ๋ฒˆ์งธ ํŽ˜์ด์ง€๋ถ€ํ„ฐ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€๊นŒ์ง€์˜ ๋ชจ๋“  ๋‚ ์งœ ๋ฐ์ดํ„ฐ์™€ ์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
for page_no in range(1, last_page + 1) :
    page_url = f"https://finance.naver.com/sise/sise_index_day.naver?code=KPI200&page={page_no}"
    source = requests.get(page_url).text
    source = bs4.BeautifulSoup(source)
    dates = source.find_all('td', class_="date") # ๋‚ ์งœ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    for date in dates :
        date_list.append(date.text)
    
    prices = source.find_all('td', class_="number_1") # ์ฒด๊ฒฐ๊ฐ€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    for price in prices[::4] :
        price_list.append(price.text)

# ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ Dataframe ํ˜•ํƒœ๋กœ ์ €์žฅํ•˜๊ณ  ๊ฒฐ์ธก์น˜๋Š” ์ œ๊ฑฐํ•œ๋‹ค.
df_kosdaq = pd.DataFrame({"date" : date_list,
                         "price" : price_list}).dropna()
# Dataframe์„ ์—‘์…€ํŒŒ์ผ๋กœ ์ €์žฅํ•œ๋‹ค.                         
df_kosdaq.to_excel("kpi200.xlsx", index = False)

0๊ฐœ์˜ ๋Œ“๊ธ€