🌥 BE TIL Day 9 0324

JB·2022년 3월 24일

Backend Crawling Database cheerio mongoose puppeteer scraping

CodeCamp BE 02

목록 보기

9/30

⬇️ Main Note
https://docs.google.com/document/d/1IZ5yYEtX92E7k2ijoAZZB3W_nBG9MpGPX6OKk_POxLQ/edit

🐚 Scraping vs. Crawling

🧃 Scraping

Literally scraping the other site's data only once.
use Cheerio as a tool.

🍷 Crawling

Constantly getting the data from other web site.
use Puppeteer as a tool.

How scraping/crawling works:

inspect/developer tools command: command + option + i

There is <em> tag in elements.
--> Bringing the data is scraping and doing whatever else with that data depends on the developer.

XML

Before knowing about scraping, the form before JSON should be understood. Before JSON, XML form is used.
XML: Extensible markup language
--> </> => hyper markup language
--> example of XML: <Writer/>, <School/>, etc
Before JSON, <Name>JB</Name> format was used.
--> Inefficient (there needs two divs that encompass the value)
But by using JSON, HTML is received so drawn in string fomula.
--> Able to feth in postman.
GET https://naver.com : able to get the data of elements.

🐚 Scraping

Cheerio
Cheerio helps to get HTML tags into string form. [tool]

When we send particular links to some other sites, for example like discord, there pops out a preview image and title on the link box.
When a site is created, there is meta tag and property added to og in the head tag. Here, Discord developers create these tags.
--> Creating link-preview
og is created by Facebook, where Facebook first wanted to create the link-preview. og stands for open graph.
If I'm creating my own site and the site address is mysite.com, meta tag should be initially created in the head tag.
<meta og: title />, <meta og: image/>

Process

The user uploads a post that contains --> title:"Hi there, this is my title" , contents: "The weather's nice today. I want you guys to visit this site: aaa.com"
--> here, the goal is to show link-preveiw to users. (The title and image of the site.)
To achieve this goal, the title and contents should be sent to backend via API.
--> Post '/boards' => Sent in a form of JSON
Here, backend developers pick out the link that starts with http from the contents. HERE, we need scraping. (axios.get)
--> And that result is put into another variable.
Then find the meta og tag inside the developer tool - elements page.
--> After picking the data that are needed, title, contents, and ogs should be sent to database.

Practice

import axios from "axios"
import cheerio  from "cheerio"

async function createBoardAPI(mydata){      // mydata <== frontendData 데이터 가져오기

    const targetUrl = mydata.contents.split(" ").filter((el) => el.startsWith("http"))[0]
    //공백을 기준으로 split을 하면 한 단어씩 썰려서 배열로 출력됨 //=> 이때 http로 시작하는 애를 가져오면 되는거
    // 이렇게 하면 최종 결과로 "http로 시작하는거 하나만" 배열에 들어오게 됨 
    // 그 배열의 0번째를 뽑아와야 순수하게 주소만 뽑아올 수가 있는거임

    const aaa = await axios.get(targetUrl)
    const $ = cheerio.load(aaa.data)
    $("meta").each((_, el) => {    // 메타태그들만 쭉 뽑혀져 나오는거임 => .each : for문처럼 작용 (meta의 모든 태그에서 작동해줘) 
        // _ :몇번째 meta tag인지 // el=element => ex) 3번째면 3번째 meta tag의 내용을 가져오는것
        // 우리한테 필요한건 og: 가 포함되어있는 meta tag
        
        // $가 특정 태그를 컨트롤 하는 애

        if ($(el).attr('property')){       // $("meta").each((_, el) => { 인 상태로 하면 모든 meta tag를 돌아보기 때문에 비효율적임. 그래서 if 문 가동
            const key = $(el).attr('property').split(":")[1]    //속성이 property인, og: 을 가지고있는 속성을 찾는것 
            // ==> split(":") --> :을 기준으로 og와 url이 나눠짐 ['og', 'title'] 이런식으로 여기서 title은 1번째 인덱스에 있는거임
            // title --> key, "네이버" --> value
            const value = $(el).attr('content') // 네이버라는 단어가 나옴
            console.log(key, value)
        }

    })     
}

const frontendData = {      // frontend에서 게시물을 등록할때 아래 내용을 등록한다:
    title: "Hi there, this is my title 😚 ",
    contents: "The weather's nice today. I want you guys to visit this site: https://naver.com 입니다~"
}
createBoardAPI(frontendData)

onclick is an attribute (속성)
Property is also an attribute
<meta og: title/>

When scraping happens constantly, that becomes crawling.

🐚 Crawling

When I want to do something after opening a browser, Puppeteer is used. [tool]

// 여기어때 크롤링 위법 사례: https://biz.chosun.com/topics/law_firm/2021/09/29/OOBWHWT5ZBF7DESIRKNPYIODLA/
// 무차별적으로 크롤링을 요청하다보면 접속자가 많아져서 메모리가 많이 필요하개 됨 => 이러면 더 많은 컴퓨터가 필요해지게 됨

import puppeteer from 'puppeteer'

async function startCrawling(){  //하나씩 다 기다려줘야함 (브라우저 열고 창 열고)
    const browser = await puppeteer.launch({headless: false}) // 브라우저 나타남
    const page = await browser.newPage()    // 새 페이지 열기
    await page.setViewport({width: 1280, height: 720}) // page 크기도 지정 가능함
    await page.goto("https://www.goodchoice.kr/product/search/2")   // chromium 브라우저로 이동하게 됨  // chromium을 기반으로 해서 만들어진 브라우저가 크롬임 (둘은 전혀 다른거)
    page.waitForTimeout(1000) // 접속하고 시간텀을 주고 접속하는거임 


    const star = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > div > span", (el) => el.textContent)
                             //$eval은 한개에 대해서, $$eval은 여러개 선택할 때       // '>' => 자식으로 있는 태그                   //div의 자식이 span이다 
    // child()의 숫자만 다름 => 다른 호텔 성급: #poduct_list_area > li:nth-child(3) > a > div > div.name > div > span //=> 이러면 for문 돌려서 모든 데이터 가져오기 가능
    page.waitForTimeout(1000)

    const location = await (await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > p:nth-child(4)", (el)=> el.textContent)).trim()
    page.waitForTimeout(1000)

    const price = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.price > p > b", (el) => el.textContent)
    page.waitForTimeout(1000)

    console.log("⭐️ star:", star)
    console.log("📍 location:", location)
    console.log("💳 Price:", price)

    await browser.close()    // crawling 끝나면 browser 종료해주기
}

startCrawling()

iframe

iframe is a separate page inside the browser. (사이트에 각기 다른 알맹이)
--> ifram is a total different page.
--> The outerShell and the inside is different.
Even if the devleoper brings the data by Copy selector, iframe selector doesn't work on the site selector.
EX) If I copied $30 product in the market by iframe selector, I'm trying to get the data inside the iframe of the market site.
--> The accessing site is naver, but the data of iframe is getting pulled out.