[python] #1. BeautifulSoup - 설치와 태그 기본 읽기

늘 공부하는 괴짜·2021년 10월 8일

beautifulsoup python

python - web crawling

목록 보기

2/20

일단 설치는 했고

본격적인 크롤링 기본을 디립따 파보는 시간을 가져 볼 것이다. 기초부터 쭉 다져볼까 한다.

본 작성글은

"뷰티풀수프 문서" https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/ 를 정독하며 정리한 글인데 전부다 볼 예정이다. 엄청나게 많아 보이지만 그건 착각이다. 1개씩 포스팅 할 예정이다.

BeautifulSoup 의 html.parser

아래 코드를 실행해 보자.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag)

이런 경고메시지가 반긴다.

GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 19 of the file /python/craw/craw1.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

뭐 대충 "html.parser" 를 BeautifulSoup 생성자 파라미터에 추가하라는 내용이다.

이런 식으로 말이다.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

html을 읽었으면 꺼내보자

< b> 태그를 꺼내줘

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
bTag = soup.b
print(bTag)

<!-- 결과 : b 태그의 내용을 다 끌고온다 -->
<b class="boldest">Extremely bold</b>

< b> 태그의 class 속성을 꺼내줘

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
bTag = soup.b["clsss"]

<!-- 결과 : b태그의 class 속성을 끌고온다 -->
boldest

class 의 경우 값을 여러개 중첩시키는것도 가능한데?

soup = BeautifulSoup('<b class="boldest nana">Extremely bold</b>')
bTag = soup.b["clsss"]

<!-- 결과 : b태그의 class 속성을 모두 끌고온다 -->
['boldest', 'nana']

속성 수정도 가능하다!!

없는 속성 추가의 경우

soup = BeautifulSoup('<b class="boldest nana">Extremely bold</b>')
bTag = soup.b
bTag["id"] = "bId"
print(bTag)

<!-- 결과 : id 속성이 추가되었다. -->
<b class="boldest nana" id="bId">Extremely bold</b>

기존 속성 수정(1번과 동일하므로 패스)
다중이 속성의 경우는?

soup = BeautifulSoup('<b class="boldest nana">Extremely bold</b>')
bTag = soup.b
bTag["class"] = ["s-class", "e-class"]
print(bTag)

<!-- 결과 : class 속성이 변경되었다. -->
<b class="s-class e-class">Extremely bold</b>

태그 내부 텍스트 수정도 됨요

soup = BeautifulSoup('<b class="boldest nana">Extremely bold</b>')
bTag = soup.b
bTag.string.replace_with("No corona")
print(bTag.string)

<!-- 결과 : 텍스트가 변경되었다 -->
No corona

코로나가 싫다.

주석은 이렇게

기본적인 주석은 아래와 같이 표시된다.

soup = BeautifulSoup('<b class="boldest nana"><!-- This is comment --></b>', 'html.parser')
bTag = soup.b
print(type(bTag.string))
print(bTag.string)

<!-- 결과 : 타입은 주석이고 주석 안의 텍스트 표시 -->
<class 'bs4.element.Comment'>
This is comment

태그 안에 주석을 넣고 다시 읽으면?

soup = BeautifulSoup('<b class="boldest nana">This is String type</b>', 'html.parser')
bTag = soup.b
print(bTag.string)

bTag.string.replace_with("<!-- This is comment -->")

print(bTag.string)
print(type(bTag.string))

<!-- 결과 : 태그 내부의 텍스트는 수정되었지만 코멘트가 아닌 문자열로 읽는다 -->
This is String type
<!-- This is comment -->
<class 'bs4.element.NavigableString'>

혹시 CDATA 로 묶어서 넣으면 되나?

soup = BeautifulSoup('<b class="boldest nana">This is String type</b>', 'html.parser')
bTag = soup.b
print(bTag.string)

bTag.string.replace_with(CData("<!-- This is comment -->"))

print(bTag.string)
print(type(bTag.string))

<!-- 결과 : 타입이 CData로 나옴... -->
This is String type
<!-- This is comment -->
<class 'bs4.element.CData'>