WIL - 3

최민규·2023년 1월 19일

0108-seoul-120-view-scrapping-input

서울특별시 다산콜센터의 주요 민원(자주 묻는 질문)에 대한 답변 정보
https://opengov.seoul.go.kr/civilappeal/list

pandas : 파이썬에서 사용할 수 있는 엑셀과 유사한 데이터 분석 도구
requests : 매우 작은 브라우저로 웹사이트의 내용과 정보를 불러옴
BeautifulSoup : requests로 가져 온 웹사이트의 html 태그를 찾기 위해 사용함
time : 한 번에 많은 데이터를 가져오면 서버에 부담을 줄 수 있어 시간 간격을 두고 가져오기 위해 사용

summary
- [strip()](https://wikidocs.net/33017) : 문자열 처리, 양 끝 공백 제거
- [requests.get(*url*)](https://www.w3schools.com/python/ref_requests_get.asp) : url의 서버에 GET 요청을 보내는 메서드
- [select()](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select) : CSS selector 로 태그 객체를 찾아 반환
- [find()](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select) : 태그를 이용하여 원하는 부분 추출
  - 태그는 이름(name), 속성(attribute), 값(value)로 구성 되어있음
  - 따라서, find() 로는 해당 이름, 속성, 값을 특정하여 태그를 찾을 수 있음
  - 동일한 태그가 여러 개 있을 경우, 첫 번째 태그 1개만 가져옴
- [find_all()](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all) : 해당 태그를 전부 가져옴
- [get_text()](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=get_text) : (현재 태그 포함)모든 하위 태그를 제거, 유니코드 텍스트만 들어있는 문자열을 반환
  - text 로 간추려서 쓸 수도 있음
- [pd.concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat) : 데이터 프레임 합치기
- [df.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html?highlight=shape) : 파일의 로우과 컬럼의 개수를 튜플로 반환
- [df.head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html?highlight=head) : 데이터 프레임 내의 처음 n줄의 데이터를 출력
- [df.tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html?highlight=tail) : 데이터 프레임 내의 마지막 n줄의 데이터를 출력
- [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html?highlight=pandas%20read) : 데이터 프레임의 리스트에 있는 html 테이블을 읽어옴
  - pd는 import 할때 pandas as pd 를 했기 때문에 축약하여 사용 가능, as 를 쓰지 않았다면 pandas.read_html
  - pd.read 는 html 말고도 다양한 형식을 불러올 수 있음
- df.set_index() : 기존 열을 사용하여 인덱스를 설정함
  - 기존 인덱스를 대체하거나 확장할 수 있음
- [df.reset_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) : 인덱스를 리셋함
- df.transpose() : 인덱스와 컬럼을 교체함
  - 전치행렬을 만들 때 쓰임
  - T 로 간추려서 쓸 수 있음
- [df.apply()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply) : 특정 행과 열에 일괄로 함수를 적용하는 메서드
  - 적용할 함수를 괄호 안에 작성
  - axis : 0은 로우, 1은 컬럼 (디폴트 = 0)
- [sr.map()](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) : 인덱스에 따라서 값을 전환하는 메서드 (데이터 프레임이 아닌 시리즈 형태에 적용 가능)
  - 데이터 프레임에는 [df.applymap()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html?highlight=applymap) 이라는 메서드가 있음
- [df.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge) : 컬럼, 혹은 인덱스를 기준으로 병합함
  - how="left" : 내용번호가 중간에 누락되어 수집이 되지 않은 건이 있다면 결측치로 보이게 함
  - how="right" : 내용번호는 수집되었지만 목록에 없는 경우 목록 내용이 결측치로 보이게 함
- [df.join()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html?highlight=join) : 인덱스를 기준으로 병합함
- df[”list of columns”] : 컬럼의 리스트대로 데이터 프레임의 컬럼 순서를 재조정
- df["column"][index] : 데이터 프레임의 “컬럼”의 인덱스에 해당하는 데이터
- [tqdm.notebook](https://tqdm.github.io/docs/notebook/) : IPytho / Jupyter Notebook 의 진행 표시줄을 표시(장식) 해주는 기능
- progress_map progress_apply : map 과 apply 가 진행되는 표시줄을 표시해주는 기능
  - tqdm 을 임포트하여 사용 가능함

최민규

안녕

이전 포스트

WIL - 3

0108-seoul-120-view-scrapping-input

WIL - 2

0개의 댓글