TIL | 파이썬 추가 & 웹스크래퍼 만들기 #1

vel.Ash·2022년 4월 12일

TIL python web scraper

파이썬

-엄청 아름다운 프로그래밍 언어! 초보자도 이해하기 쉽다
-java는 웹에 더 집중한다면, python은 data science, machine learning 등 더 여러 분야의 커뮤니케이션이 가능

이론

파이썬 규칙

문자(string)는 ‘’나 “”로 표기 예시? a = “like this”
Snake Case
:변수 이름 길게 지어야할 때 를 사용하여 띄어쓰기합니다요. 암묵적약속
(모두 소문자로 작성하며, 띄어쓰기 할때마다 작성)
False , True 등은 문자로 치지 않음 예시? c = False (첫 문자는 꼭 대문자)

💡 만약 c = “False”로 쓴다면 그냥 false라는 string으로 인식

변수 종류 응용

a_string = "Like this"
a_number = 3
a_float = 3.12
a_boolean = False
a_none = None (파이썬에만 존재)
float 소수점 숫자, boolean 참&거짓
variables (변수) : 정보를 넣는 곳, 데이터를 저장하는 곳 (등호 왼쪽)

용어

int (integer)
bool (boolean)
str (string)

Sequence type

: 열거되어 있는 list 같은 것, list는 sequence type 중 하나

list []
: 중괄호 및 개체 “”단위 ,로 나누기
ex. days = ["Mon","Tue","Wed","Thur","Fri","sat"]
tuple ()
: immutable → 변경할수 없는 시퀀스
dictionary {}

🤔 dictionary 예시

nico = {"name": "Nico","age": 29,"korean": True,
"fav_food": ["Kimchi", "Sashimi"]}

print(nico["fav_food"])

함수(Built-in function)

함수는 만든다고 하기보단 정의한다고 생각하면 편하다. 함수를 정의할 때는 def(define의 약자)로 시작한다.
함수를 정의할 때 채워주는 body는 들여쓰기 혹은 스페이스바로 여백을 만들어줘야 한다. 들여쓰기가 없으면 함수의 body가 될 수 없음.

예시들

print(len(”lalsmfkdfljslfjsdlkkfjlsdf”)
: 길이를 출력해줘

🤔 default value 정의 예시

def puls(a, b=0)
 print(a + b)

def minus(a=0, b)
 print(a - b)

plus(2)
minus(None, 2)

→ 만약 argument(인수)값이 들어가지 않는다면 default value로 정의되게 됨.

🤔 return 함수 사용 예시

def plus(a, b):
   return a + b

result = plus(2,4)
print(result)

주의사항

함수내 return 밑으로 내려오는 실행문들은 실행되지 않고 바로 종료. return 뒤에 실행문 쓰지 않도록 주의

🤔 string 변수로 변경하여 사용 예시

def say_hello(name, age):
 return f"Hello {name} you are {age} years old"

hello = say_hello("nico", "12")
print(hello)

argument 값 부여 방법

Keyword argument (좋은방식 💜 )
: 이름으로 쌍을 지어주는 방식. ex.b=30, a=1
positional argument
: 순서대로 인자 부여 진행. (위치에 의존)

<참고 링크>
https://docs.python.org/3/library/index.html

if 문

<예시 1>

def plus(a, b):
	if type(b) is str:
		return None
	else:
		return a + b

# is -> object identity
# is not -> negated object identity

def plus(a, b):
	if type(b) is not or type(b) is float:
		return a + b
	else:
		return None

print(plus(12, 1.2))

for 문

-string, tuple or list or other iterable object 에서 사용 가능

days = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri')

for i in days:
	if day is 'Wed':
		break
	else:	
		print(i)

Web scraper 만들기

request package

: 파이썬에서 요청을 만드는 기능을 모아놓은 것

the power of Requests

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
r.status_code
200
r.headers['content-type']
'application/json; charset=utf8'
r.encoding
'utf-8'
r.text
'{"type":"User"...'
r.json()
{'private_gists': 419, 'total_private_repos': 77, ...}

beautiful soup4

-html에서 정보를 추출하기 정말 유용한 라이브러리. 원하는 데이터를 가져오기 쉽게 비슷한 분류의 데이터별로 나누어줌.
-https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (공식 문서 참조)

1.request package 이용해서 HTML 추출

import requests

# requests 라이브러리 사용해서 HTML페이지 요청 -> indeed_resul 객체에 HTML 데이터 저장
indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

print(indeed_resul)
print(indeed_resul.text)  # 예시 1. html text 전부를 가져오고 싶은 경우

<Response [200]>   # ok 라는 뜻 
# text 전부 추출됨

2.beatiful soup이용하여 pagination 추출

-웹페이지내 pagination이 있는 html부분 먼저 확인하기

import requests
from bs4 import BeautifulSoup    # beatifulSoup 라이브러리 import 

indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

# beautifulsoup4 라이브러리 사용해서 HTML 파싱하기
# soup = BeautifulSoup(html_doc, 'html.parser')
indeed_soup = BeautifulSoup(indeed_resul.text, "html.parser")

# HTML 파싱 후 ul 태그 가져오기
# find 메소드를 통해서 태그를 검색할 수 있음 -> 하나의 tag 찾음
pagination = indeed_soup.find("ul", {"class":"pagination-list"})

# find_all은 조건에 맞는 모든 tag를 리스트로써 찾아줌
links = pagination.find_all('a')

pages = []

# a 내의 자식관계로 있는 span을 찾기 위함
# 이미 pages가 리스트로 for문 사용 
for link in links:
  pages.append(link.find("span"))
pages = pages[:-1]

print(pages)

[<span class="pn">2</span>, <span class="pn">3</span>, <span class="pn">4</span>, <span class="pn">5</span>]

3.마지막 pagination 값 찾아내기

import requests
from bs4 import BeautifulSoup

indeed_resul = requests.get('https://kr.indeed.com/jobs?q=python&l=%EC%84%9C%EC%9A%B8&vjk=1015284880e2ff62')

indeed_soup = BeautifulSoup(indeed_resul.text, "html.parser")

pagination = indeed_soup.find("ul", {"class":"pagination-list"})

links = pagination.find_all('a')

pages = []

# 예시 1 
for link in links[:-1]:
  pages.append(int(link.find("span").string))

# 예시 2 
for link in links[:-1]:
  pages.append(int(link.string))

# 예시 1과 2 모두 동일한 값이 나옴 -> 더 간단한 예시 2 사용

# 마지막 페이지 값 찾아주기 
for link in links[:-1]:
  pages.append(int(link.string))
  
max_page = pages[-1]

[2, 3, 4, 5]
5

4. 매 페이지내 request 로 추출 및 파일 모듈화하기

<main.py>

from indeed import extract_indeed_pages, extract_indeed_jobs

last_indeed_page = extract_indeed_pages()

extract_indeed_jobs(last_indeed_page)

<indeed.py>

import requests
from bs4 import BeautifulSoup

LIMIT = 50 
URL = f"https://kr.indeed.com/%EC%B7%A8%EC%97%85?as_and=python&as_phr&as_any&as_not&as_ttl&as_cmp&jt=all&st&salary&radius=25&l=%EC%84%9C%EC%9A%B8&fromage=any&limit={LIMIT}"

def extract_indeed_pages():
  resul = requests.get(URL)
  soup = BeautifulSoup(resul.text, "html.parser")
  pagination = soup.find("ul", {"class":"pagination-list"})

  links = pagination.find_all('a')
  pages = []
  for link in links[:-1]:
    pages.append(int(link.string))
  
  max_page = pages[-1]
  return max_page

def extract_indeed_jobs(last_page):
  for page in range(last_page):
    result = requests.get(f"{URL}&start={page * LIMIT}")
    print(result.status_code)

vel.Ash

코린이의 개발공부

이전 포스트

TIL | 알고리즘 기초 #2

다음 포스트

TIL | 알고리즘 기초#3 & Git

1개의 댓글

미니

2022년 4월 12일

간만입니다 애쉬님^^
아이고 고생하신게 티납니다 고생많으셧어요🎈

답글 달기