스파르타코딩클럽 [웹개발] 3주차 (3)

zerovodka·2022년 4월 2일

웹개발 항해99

항해99

목록 보기

7/9

파이썬 기초 문법을 안다.
원하는 페이지를 크롤링 할 수 있다.

3. pymongo를 통해 mongoDB를 제어할 수 있다.

DB를 쓰는 이유가 무엇일까?

잘 넣어두고
잘 쓰기위해

DB는 눈에 보이지 않는 Index라는 순서로 정리가 되어있다.

<DB의 종류>

(그림 / 설명 출처: 스파르타코딩클럽)

RDBMS(SQL)

행/열의 생김새가 정해진 엑셀에 데이터를 저장하는 것과 유사합니다.
데이터 50만 개가 적재된 상태에서, 갑자기 중간에 열을 하나 더하기는 어려울 것입니다.
그러나, 정형화되어 있는 만큼, 데이터의 일관성이나 / 분석에 용이할 수 있습니다.

ex) MS-SQL, My-SQL 등

No-SQL

딕셔너리 형태로 데이터를 저장해두는 DB입니다.
고로 데이터 하나 하나 마다 같은 값들을 가질 필요가 없게 됩니다.
자유로운 형태의 데이터 적재에 유리한 대신, 일관성이 부족할 수 있습니다.

ex) MongoDB

<DB가 그래서 뭔데?>

우리가 쓰는 프로그램과 같은 것
게임, PPT 설치하듯 DB도 설치할 수 있다

이런 DB를 요즘엔 클라우드 서비스로 제공해준다
그 중 최신 클라우드 서비스인 mongoDB Atlas를 이용해보자.

<mongoDB 사용해보자>

회원가입 후 두가지의 패키지를 설치해야 한다

pymongo, dnspython

그 후 mongoDB에서 사용하는 언어(Python)과 그 버전을 넣어준 후 연결해준다.

pymongo 기본 세팅

from pymongo import MongoClient
client = MongoClient('여기에 URL 입력')
db = client.dbsparta 		# sparta는 mongoDB에 접근가능하게 한 user의 id이다.

위 명령어를 사용하여 앞서 크롤링한 영화 데이터를 DB에 저장해보자

mongoDB에 데이터가 잘 저장된 것을 볼 수 있다.

<크롤링 추가 과제>

https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20210701
에서 순위, 노래명, 가수를 크롤링 해오자.

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20210701',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')
# rank
#body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.number
#body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.number

# title
#body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.title.ellipsis
#body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.title.ellipsis

# singer
#body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.artist.ellipsis
#body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.artist.ellipsis

musics = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for music in musics:
    title = music.select_one('a.title.ellipsis').text.strip()
    rank = music.select_one('td.number').text[0:2].strip()
    singer = music.select_one('a.artist.ellipsis').text
    if '19금' in title:
        title = title.strip('19금')
        title = title.strip()

    print(rank, title, singer)

15위 곡인 Peaches에서 좀 애먹긴 했는데 split()함수를 잘 이용하면 풀 수 있다.

zerovodka

Break Limit

이전 포스트

스파르타코딩클럽 [웹개발] 3주차 (2)

다음 포스트