๐Ÿ–ฑ๏ธ[Crawling] ๋‰ด์Šค ์ œ๋ชฉ๊ณผ ๋งํฌ ๊ฐ€์ ธ์˜ค๊ธฐ

๊ถŒ๊ทœ๋ฆฌยท2023๋…„ 5์›” 24์ผ
0

Crawling

๋ชฉ๋ก ๋ณด๊ธฐ
3/7
post-thumbnail

01. ์„ค๊ณ„ ๊ณ„ํš โš’๏ธ

์‚ฌ์šฉ์ž๊ฐ€ ๋‰ด์Šค ํ‚ค์›Œ๋“œ๋ฅผ ์ง์ ‘ ์ž…๋ ฅํ•˜์—ฌ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก pyautogui ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ํŒ์—…์ฐฝ์„ ๋„์›Œ์ค€๋‹ค. ๋˜ํ•œ ๋‰ด์Šค๋ฅผ ๋‹จ์ผํŽ˜์ด์ง€๋งŒ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์—ฌ๋Ÿฌ ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋งํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ช‡ ํŽ˜์ด์ง€๊นŒ์ง€ ํฌ๋กค๋งํ•  ๊ฒƒ์ธ์ง€ ๋ฌผ์–ด๋ณด๋Š” ํŒ์—…์ฐฝ๋„ ๋„์šธ ๊ฒƒ์ด๋‹ค.

requests์™€ beautifulsoup, pyautogui๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•˜๋Š” ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ„ฐ๋ฏธ๋„์— ๋‰ด์Šค์˜ ์ œ๋ชฉ๊ณผ ๋งํฌ๊ฐ€ ์ถœ๋ ฅ๋˜๋„๋ก ์ฝ”๋“œ๋ฅผ ์„ค๊ณ„ํ–ˆ๋‹ค.


02. ์‚ฌ์šฉํ•œ ์ฝ”๋“œ ๋ฐ ๋ฌธ๋ฒ•

1. pyautogui

  • ๋งˆ์šฐ์Šค, ํ‚ค๋ณด๋“œ ๋งคํฌ๋กœ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • ๊ฐ„๋‹จํ•œ ํŒ์—…์ฐฝ์„ ๋„์›Œ์ฃผ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ
keyword= pyautogui.prompt("๊ฒ€์ƒ‰์–ด๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.")
lastpage= pyautogui.prompt("๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.")
pagenum=1 # ํŽ˜์ด์ง€์˜ ์ดˆ๊ธฐ๊ฐ’์„ 1ํŽ˜์ด์ง€๋กœ ์„ค์ •


๐Ÿ“ ๋ฐ˜๋ณต๋ฌธ ํ˜•์‹๋ถ€ํ„ฐ ์•Œ๊ณ ๊ฐ€์ž

  • for i in range ( ์‹œ์ž‘, ๋, ๋‹จ๊ณ„)
  • ์˜ˆ๋ฅผ๋“ค์–ด, for i in range ( 1, 10, 2 ) // 1, 3, 5, 7, 9 ์ถœ๋ ฅ

2. ํŽ˜์ด์ง€๊ฐ€ ๋ฐ˜๋ณต๋˜๋Š” ๋ฐ˜๋ณต๋ฌธ

  • ์‚ฌ์šฉ์ž์—๊ฒŒ ์ž…๋ ฅ๋ฐ›๋Š” ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ํŽ˜์ด์ง€๋ฅผ ํ„ฐ๋ฏธ๋„์— ์ถœ๋ ฅํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

  • ์˜ค๋กœ์ง€ "ํŽ˜์ด์ง€"๊ฐ€ ๋ฐ˜๋ณต๋˜๋Š” ๋ฐ˜๋ณต๋ฌธ

  • ํ„ฐ๋ฏธ๋„์—์„œ ๋ช‡ ํŽ˜์ด์ง€์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋ฐ˜๋ณต๋ฌธ ์•ˆ์— print(f"===={pagenum}ํŽ˜์ด์ง€์ž…๋‹ˆ๋‹ค====") ๊ตฌ๋ฌธ์„ ๋„ฃ์–ด๋’€๋‹ค.

for i in range(1,int(lastpage)*10,10):
    print(f"============={pagenum}ํŽ˜์ด์ง€์ž…๋‹ˆ๋‹ค=============")
    response= requests.get(f"https://search.naver.com/search.naver?where=news&sm=tab_jum&query={keyword}&start={i}" )
    html= response.text
    soup= BeautifulSoup(html,'html.parser')
    links= soup.select('.news_tit')
    pagenum= pagenum+1

์šฐ์„ , ๋ฐ˜๋ณต๋ฌธ์˜ ๋ฒ”์œ„๋ฅผ ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋„ค์ด๋ฒ„์—์„œ ๋‰ด์Šค ํƒญ์„ ๋“ค์–ด๊ฐ€์„œ ํŽ˜์ด์ง€ ๋„˜๊ธฐ๋ฉด url ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ€๋ถ„์— ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” start๋ผ๋Š” key๊ฐ€ ์žˆ๋‹ค. start ์˜ value๋ฅผ ๋ณด๋ฉด ๋ฐ‘์— ์‚ฌ์ง„๊ณผ ๊ฐ™๋‹ค.
( URL์€ ์ดํ•ด๋ฅผ ์œ„ํ•ด path์™€ ์ผ๋ถ€ parameter์€ ์‚ญ์ œํ–ˆ๋‹ค. ) ์ฒซ ๋ฒˆ์งธ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด 1 ํŽ˜์ด์ง€์—์„œ start = 1 , ๋‘ ๋ฒˆ์งธ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด 2 ํŽ˜์ด์ง€์—์„œ start = 11 , ์„ธ ๋ฒˆ์งธ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด 3 ํŽ˜์ด์ง€์—์„œ start = 21 ์ด๋‹ค. ๋”ฐ๋ผ์„œ ํŽ˜์ด์ง€ ์ˆ˜๊ฐ€ ํ•˜๋‚˜์”ฉ ๋Š˜ ๋•Œ๋งˆ๋‹ค , start์˜ value๋Š” 10์”ฉ ๋Š˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

for i in range(1,int(lastpage)*10,10): ๋ฐ˜๋ณต๋ฌธ์˜ ๋ฒ”์œ„๋Š”, 1 ํŽ˜์ด์ง€๋ถ€ํ„ฐ ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ํŽ˜์ด์ง€ ๊ฐ’์— 10์„ ๊ณฑํ•œ ๋ฒ”์œ„๊นŒ์ง€๋‹ค. ์ด ๋ฒ”์œ„ ๋‚ด์—์„œ 10์”ฉ ๋‹จ๊ณ„๊ฐ€ ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค. ๐Ÿ”Ž ์ด๋Ÿฐ์‹์œผ๋กœ! 1, 11, 21, 31, 41, ..., int(lastpage)*10

๐Ÿค” ์™œ int ( lastpage )์—์„œ int๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„๊นŒ?

  • ์‚ฌ์šฉ์ž๊ฐ€ pyautogui์—์„œ ๋ฌธ์ž์—ด๋กœ lastpage ๊ฐ’์„ ์ž…๋ ฅํ–ˆ์œผ๋‹ˆ ์ˆซ์ž๋กœ ํ˜•๋ณ€ํ™˜์„ ํ•ด์ค€ ๊ฒƒ์ด๋‹ค.

3. ๋‰ด์Šค ๋ฆฌ์ŠคํŠธ๊ฐ€ ๋ฐ˜๋ณต๋˜๋Š” ๋ฐ˜๋ณต๋ฌธ

  • ํ•œ ํŽ˜์ด์ง€ ์•ˆ์— ํ•œ๊ฐœ์˜ ๋‰ด์Šค๊ธฐ์‚ฌ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ํ‘œํ˜„ํ•œ ๋ฐ˜๋ณต๋ฌธ
  • ๋‰ด์Šค์˜ ์ œ๋ชฉ๊ณผ ๋งํฌ๋งŒ ๋ฆฌ์ŠคํŠธ๋กœ ์ƒ์„ฑ
  • ์ด ๋ฐ˜๋ณต๋ฌธ์„ ํŽ˜์ด์ง€ ๋ฐ˜๋ณต๋ฌธ ์•ˆ์— ๋„ฃ์—ˆ์Œ
  for link in links:
     title =link.text #ํƒœ๊ทธ ์•ˆ์— ํ…์ŠคํŠธ ์š”์†Œ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค
     url =link.attrs['href'] #href์˜ ์†์„ฑ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค
     print(title, url)

ํ„ฐ๋ฏธ๋„์—๋Š” ํŽ˜์ด์ง€๋งˆ๋‹ค ๋‰ด์Šค ์ œ๋ชฉ๊ณผ url ์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“ค์–ด ์ถœ๋ ฅ๋œ๋‹ค.

profile
๊ทค๊ทค ์ฝ”๋”ฉ

0๊ฐœ์˜ ๋Œ“๊ธ€