[빅 리더 AI] 크롤러 4일차

KrTeaparty·2022년 7월 9일

빅 리더 AI

목록 보기

5/7

네이버에서 특정 키워드로 검색 후 여러 건의 블로그의 이미지와 텍스트 정보 수집

네이버 블로그에서 작성일자를 주고 필터링하는 부분이 어려운 부분이다.

element = driver.find_element(By.ID,"query")
element.send_keys(query_txt)
element.submit()

driver.find_element(By.LINK_TEXT,'VIEW').click()
time.sleep(1)
driver.find_element(By.LINK_TEXT, '블로그').click()
time.sleep(1)
driver.find_element(By.LINK_TEXT, '옵션').click()
time.sleep(1)
driver.find_element(By.LINK_TEXT, '최신순').click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="snb"]/div[2]/ul/li[3]/div/div[1]/a[9]').click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[1]/span[1]/a').click()

driver.find_element(By.LINK_TEXT, start_year).click()

driver.find_element(By.XPATH, '//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[2]/div[2]/div/div/div/ul/li[{}]/a'.format(start_mon)).click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[2]/div[3]/div/div/div/ul/li[{}]/a'.format(start_day)).click()
time.sleep(1)
driver.find_element(By.XPATH,'//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[1]/span[3]/a').click()

driver.find_element(By.LINK_TEXT,end_year).click()
time.sleep(1)  
driver.find_element(By.XPATH,'//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[2]/div[2]/div/div/div/ul/li[{}]/a'.format(end_mon)).click()
time.sleep(1)
driver.find_element(By.XPATH,'//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[2]/div[3]/div/div/div/ul/li[{}]/a'.format(end_day)).click()
time.sleep(1)
driver.find_element(By.XPATH, '//*[@id="snb"]/div[2]/ul/li[3]/div/div[2]/div[3]/button').click()
time.sleep(2)

과정
1. 키워드로 검색
2. VIEW 클릭
3. 블로그 클릭
4. 옵션 클릭
5. 최신순으로 정렬 클릭
6. 시작날자 클릭
7. 년, 월, 일 클릭
8. 종료 날짜 클릭
9. 년, 월, 일 클릭
10. 적용 클릭

년, 월, 일은 XPATH를 통해 사용자 입력에 따라 클릭될 수 있었기에 그렇게 코드를 작성했다.

네이버 블로그 검색 결과는 30개가 나오고 그 이상은 스크롤을 해야 불러와진다.

def scroll_down(driver):
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    time.sleep(3)

i = 1
while (i <= page_cnt+2):
    scroll_down(driver) 
    i += 1
    print('%s 페이지 정보를 추출' %i)

url_all_list=[]    #조회할 블로그의 URL 정보 저장용 리스트

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
url_list_1 = soup.find('ul','lst_total').find_all('li')

for a in url_list_1 :
    url_all_list.append( a.find('div','total_area').find_all('a') )

url_detail=[]
for b in range(0,len(url_all_list)) :
    url_detail.append( url_all_list[b][5]['href'] )

url_final_list=[]
no = 1
for c in url_detail :    
    if c.split('/')[2] == 'blog.naver.com' : # 네이버 블로그만 수집
        url_final_list.append(c)
        no += 1

        if no > cnt :
            break

추출할 정보는 블로그 주소, 작성자 닉네임, 작성일자, 블로그 본문 내용, 이미지이다.
네이버는 검색으로 나온 블로그를 클릭하면 새탭으로 열리기 때문에 번거로움을 줄이기 위하여 미리 대상 URL을 모두 수집하고 driver.get()을 이용하기로 했다. 그러기 위해서 미리 필요한 만큼 스크롤을 미리 내려둔다.

네이버 블로그는 iframe을 사용하는 것도 주의해야 하고, 네이버 블로그가 몇 가지로 구분되어 있으며 그에 따라 태그가 달라지는 것도 있었다. 이는 각 분류마다 if문을 사용해 따로 수집 코드를 만들어주면 된다.

        # 이미지 저장
        img_dir = f_dir+'NAVER_BLOG'+'-'+s+'-'+query_txt+'/'+wname+'/'
        os.makedirs(img_dir)
        
        img_src1 = soup.find('div','post-view pcol2 _param(1) _postViewArea221286810022').find_all('img')
        for img in img_src1 :
            img_src2 = img['src']
            print(img_src2)
            try:
                urllib.request.urlretrieve(urllib.parse.quote(img_src2.encode('utf8'), ':/'),img_dir+str(img_no)+'.jpg')
                img_no += 1
            except:
                img_no += 1
                continue

이미지 저장하는 부분이다. urlretrieve() 안에 parse.quote()로 URL에 포함되었을지 모르는 한글을 변환해주는 것이 중요하다.

아마존 베스트 셀러 가져오기

# 정보 입력
print("=" *80)
print(" 아마존 카테고리별 Best Seller 상품 정보 추출하기 ")
print("=" *80)

department = ['Amazon Devices & Accessories','Amazon Launchpad','Appliances',
              'Apps & Games','Arts, Crafts & Sewing','Audible Books & Originals',
              'Automotive','Baby','Beauty & Personal Care','Books',
              'Camera & Photo Products','CDs & Vinyl','Cell Phones & Accessories',
              'Clothing, Shoes & Jewelry','Collectible Coins',
              'Computers & Accessories','Digital Educational Resources',
              'Digital Music','Electronics','Entertainment Collectibles',
              'Gift Cards','Grocery & Gourmet Food','Handmade Products',
              'Health & Household','Home & Kitchen','Industrial & Scientific',
              'Kindle Store','Kitchen & Dining','Magazine Subscriptions',
              'Movies & TV','Musical Instruments','Office Products',
              'Patio, Lawn & Garden','Pet Supplies','Software',
              'Sports & Outdoors','Sports Collectibles','Tools & Home Improvement',
              'Toys & Games','Video Games']
for i, w in enumerate(department):
    print('{}.{}'.format(i+1, w), end=' ')
    if i % 3 == 0:
        print('\n')
    
print('\n')
dep_no = int(input('1.위 분야 중에서 자료를 수집할 분야의 번호를 선택하세요: '))
cnt = int(input('2.크롤링 할 건수는 몇건입니까?: '))
page_cnt = math.ceil(cnt/50)

f_dir = input("3.파일을 저장할 폴더명만 쓰세요(기본경로:c:\\py_temp\\):")
if f_dir == '' :
    f_dir = "C:/Users/HJK/Desktop/data_kyungnam/crawler/소스코드/result/"
    
print("\n")

사용자로부터 수집할 카테고리, 건수 등을 입력 받는다. 입력받고 그것을 인덱스 삼아 나중에 find_element(By.LINK_TEXT, department[dep_no-1])로 카테고리를 클릭한다.

아마존은 처음에 30개의 결과가 나오고 그 후에는 스크롤 여러번에 걸쳐서 나오고 최종적으로 한 페이지에 50개가 나온다. 그렇기 때문에 반복문 초기에 해당 페이지에서 읽어와야 하는 것이 30개가 넘으면 충분히 스크롤을 내려주는 부분을 작성했다.

for x in range(1, page_cnt+1):    
    if cnt - count > 30:
        for _ in range(0, 6):
            scroll_down(driver)
            time.sleep(1)

G마켓 베스트 셀러 정보 수집

G마켓 베스트 셀러의 경우 페이지도 없고 스크롤해야 로딩되는 부분도 없어서 일반적으로 진행하면 된다.

		# 제품 이미지 다운로드 하기
        try :
            photo = item.find('div','thumb').find('img')['data-original']
        except AttributeError :
            continue
        photo = 'https:'+photo
        file_no += 1
        try:
            urllib.request.urlretrieve(photo,str(file_no)+'.jpg')
        except:
            print('이미지 예외')
        time.sleep(0.5)

주의해야할 점은 img의 src 부분이 "https:"를 포함하지 않아 붙여 줘야 한다는 것이다.

        try :
            discount = item.find('div','s-price').find('strong').decompose()
            discount = item.find('em').get_text().strip()
        except  :
            discount='0'
            print('5.할인율: ',discount)
            f.write('5.할인율:'+ discount + "\n")
        else :
            print('5.할인율:',discount)
            f.write('5.할인율:'+ discount + "\n")

할인율을 가져오는 부분이다. 이 부분에서 할인율을 가져오는데 방해가 되는 부분이 있어 해당 부분을 decompose()로 제거한 뒤 확실하게 가져오도록 구현했다.

import win32com.client as win32   #pywin32 , pypiwin32 설치후 동작
import win32api  #파이썬 프롬프트를 관리자 권한으로 실행해야 에러없음

excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(fx_name)
sheet = wb.ActiveSheet
sheet.Columns(2).ColumnWidth = 30   #  이미지 가로 사이즈에 맞게 컬럼 크기 조정
row_cnt = cnt+1
sheet.Rows("2:%s" %row_cnt).RowHeight = 120  #  이미지 세로 사이즈에 맞게 로우 크기 조정

ws = wb.Sheets("Sheet1")
col_name2=[]
file_name2=[]

for a in range(2,cnt+2) :
    col_name='B'+str(a)
    col_name2.append(col_name)

for b in range(1,cnt+1) :
    file_name=img_dir+'/'+str(b)+'.jpg'
    file_name2.append(file_name)
      
for i in range(0,cnt) :
    rng = ws.Range(col_name2[i])
    image = ws.Shapes.AddPicture(file_name2[i], False, True, rng.Left, rng.Top, 130, 100)
    excel.Visible=True
    excel.ActiveWorkbook.Save()
    
driver.close()

xls로 저장한 파일에 저장해둔 이미지를 넣는 부분이다. 이 코드가 동작할 때가 있고 동작하지 않을 때가 있어서 실습할 때 정말 애먹은 부분이다.
위 코드에서 조정할만한 부분은 행과 열을 지정하는 부분정도이다.

headless 모드

지금까지는 공부를 위해 작동을 보면서 할 필요가 있었다. 하지만 실제로 크롤링을 할 때 그 동안 아무것도 못하는 것이 부담이 된다. 그에 따라 백그라운드에서 할 수 있게 하는 것이 headless 모드이다.

#headless 모드 설정하기
options = webdriver.ChromeOptions()
options.add_argument('headless')
#options.add_argument('window-size=1920x1080')
#options.add_argument("disable-gpu")
args = ["hide_console", ]
#options.add_argument("--disable-extensions");
options.add_argument("disable-infobars");

path = 'c:\py_temp\chromedriver.exe'
driver = webdriver.Chrome(path,options=options,service_args=args)

이 코드가 headless 모드 옵션을 주고 실행하는 코드이다.

추가로 pyinstaller 모듈로 실행 파일을 만들 수 있다. GUI 실행 파일을 만들려면 PyQt5를 사용할 필요가 있다.

KrTeaparty

데이터를 접하는 중

이전 포스트

[빅 리더 AI] 크롤러 3일차

다음 포스트