๐ŸŒฅ BE TIL Day 9 0324

j00b33ยท2022๋…„ 3์›” 24์ผ
0

CodeCamp BE 02

๋ชฉ๋ก ๋ณด๊ธฐ
9/30

โฌ‡๏ธ Main Note
https://docs.google.com/document/d/1IZ5yYEtX92E7k2ijoAZZB3W_nBG9MpGPX6OKk_POxLQ/edit


๐Ÿš Scraping vs. Crawling

๐Ÿงƒ Scraping

  • Literally scraping the other site's data only once.
  • use Cheerio as a tool.

๐Ÿท Crawling

  • Constantly getting the data from other web site.
  • use Puppeteer as a tool.

How scraping/crawling works:

inspect/developer tools command: command + option + i

There is <em> tag in elements.
--> Bringing the data is scraping and doing whatever else with that data depends on the developer.

XML

  • Before knowing about scraping, the form before JSON should be understood. Before JSON, XML form is used.
  • XML: Extensible markup language
    --> </> => hyper markup language
    --> example of XML: <Writer/>, <School/>, etc
  • Before JSON, <Name>JB</Name> format was used.
    --> Inefficient (there needs two divs that encompass the value)
  • But by using JSON, HTML is received so drawn in string fomula.
    --> Able to feth in postman.
  • GET https://naver.com : able to get the data of elements.

๐Ÿš Scraping

Cheerio
Cheerio helps to get HTML tags into string form. [tool]

  • When we send particular links to some other sites, for example like discord, there pops out a preview image and title on the link box.

  • When a site is created, there is meta tag and property added to og in the head tag. Here, Discord developers create these tags.
    --> Creating link-preview

  • og is created by Facebook, where Facebook first wanted to create the link-preview. og stands for open graph.

  • If I'm creating my own site and the site address is mysite.com, meta tag should be initially created in the head tag.
    <meta og: title />, <meta og: image/>

Process

  1. The user uploads a post that contains --> title:"Hi there, this is my title" , contents: "The weather's nice today. I want you guys to visit this site: aaa.com"
    --> here, the goal is to show link-preveiw to users. (The title and image of the site.)
  2. To achieve this goal, the title and contents should be sent to backend via API.
    --> Post '/boards' => Sent in a form of JSON
  3. Here, backend developers pick out the link that starts with http from the contents. HERE, we need scraping. (axios.get)
    --> And that result is put into another variable.
  4. Then find the meta og tag inside the developer tool - elements page.
    --> After picking the data that are needed, title, contents, and ogs should be sent to database.

Practice

import axios from "axios"
import cheerio  from "cheerio"

async function createBoardAPI(mydata){      // mydata <== frontendData ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

    const targetUrl = mydata.contents.split(" ").filter((el) => el.startsWith("http"))[0]
    //๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ split์„ ํ•˜๋ฉด ํ•œ ๋‹จ์–ด์”ฉ ์ฐ๋ ค์„œ ๋ฐฐ์—ด๋กœ ์ถœ๋ ฅ๋จ //=> ์ด๋•Œ http๋กœ ์‹œ์ž‘ํ•˜๋Š” ์• ๋ฅผ ๊ฐ€์ ธ์˜ค๋ฉด ๋˜๋Š”๊ฑฐ
    // ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ตœ์ข… ๊ฒฐ๊ณผ๋กœ "http๋กœ ์‹œ์ž‘ํ•˜๋Š”๊ฑฐ ํ•˜๋‚˜๋งŒ" ๋ฐฐ์—ด์— ๋“ค์–ด์˜ค๊ฒŒ ๋จ 
    // ๊ทธ ๋ฐฐ์—ด์˜ 0๋ฒˆ์งธ๋ฅผ ๋ฝ‘์•„์™€์•ผ ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์ฃผ์†Œ๋งŒ ๋ฝ‘์•„์˜ฌ ์ˆ˜๊ฐ€ ์žˆ๋Š”๊ฑฐ์ž„

    const aaa = await axios.get(targetUrl)
    const $ = cheerio.load(aaa.data)
    $("meta").each((_, el) => {    // ๋ฉ”ํƒ€ํƒœ๊ทธ๋“ค๋งŒ ์ญ‰ ๋ฝ‘ํ˜€์ ธ ๋‚˜์˜ค๋Š”๊ฑฐ์ž„ => .each : for๋ฌธ์ฒ˜๋Ÿผ ์ž‘์šฉ (meta์˜ ๋ชจ๋“  ํƒœ๊ทธ์—์„œ ์ž‘๋™ํ•ด์ค˜) 
        // _ :๋ช‡๋ฒˆ์งธ meta tag์ธ์ง€ // el=element => ex) 3๋ฒˆ์งธ๋ฉด 3๋ฒˆ์งธ meta tag์˜ ๋‚ด์šฉ์„ ๊ฐ€์ ธ์˜ค๋Š”๊ฒƒ
        // ์šฐ๋ฆฌํ•œํ…Œ ํ•„์š”ํ•œ๊ฑด og: ๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ๋Š” meta tag
        
        // $๊ฐ€ ํŠน์ • ํƒœ๊ทธ๋ฅผ ์ปจํŠธ๋กค ํ•˜๋Š” ์• 

        if ($(el).attr('property')){       // $("meta").each((_, el) => { ์ธ ์ƒํƒœ๋กœ ํ•˜๋ฉด ๋ชจ๋“  meta tag๋ฅผ ๋Œ์•„๋ณด๊ธฐ ๋•Œ๋ฌธ์— ๋น„ํšจ์œจ์ ์ž„. ๊ทธ๋ž˜์„œ if ๋ฌธ ๊ฐ€๋™
            const key = $(el).attr('property').split(":")[1]    //์†์„ฑ์ด property์ธ, og: ์„ ๊ฐ€์ง€๊ณ ์žˆ๋Š” ์†์„ฑ์„ ์ฐพ๋Š”๊ฒƒ 
            // ==> split(":") --> :์„ ๊ธฐ์ค€์œผ๋กœ og์™€ url์ด ๋‚˜๋ˆ ์ง ['og', 'title'] ์ด๋Ÿฐ์‹์œผ๋กœ ์—ฌ๊ธฐ์„œ title์€ 1๋ฒˆ์งธ ์ธ๋ฑ์Šค์— ์žˆ๋Š”๊ฑฐ์ž„
            // title --> key, "๋„ค์ด๋ฒ„" --> value
            const value = $(el).attr('content') // ๋„ค์ด๋ฒ„๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ด
            console.log(key, value)
        }

    })     
}

const frontendData = {      // frontend์—์„œ ๊ฒŒ์‹œ๋ฌผ์„ ๋“ฑ๋กํ• ๋•Œ ์•„๋ž˜ ๋‚ด์šฉ์„ ๋“ฑ๋กํ•œ๋‹ค:
    title: "Hi there, this is my title ๐Ÿ˜š ",
    contents: "The weather's nice today. I want you guys to visit this site: https://naver.com ์ž…๋‹ˆ๋‹ค~"
}
createBoardAPI(frontendData)

onclick is an attribute (์†์„ฑ)
Property is also an attribute
<meta og: title/>

When scraping happens constantly, that becomes crawling.


๐Ÿš Crawling

When I want to do something after opening a browser, Puppeteer is used. [tool]

// ์—ฌ๊ธฐ์–ด๋•Œ ํฌ๋กค๋ง ์œ„๋ฒ• ์‚ฌ๋ก€: https://biz.chosun.com/topics/law_firm/2021/09/29/OOBWHWT5ZBF7DESIRKNPYIODLA/
// ๋ฌด์ฐจ๋ณ„์ ์œผ๋กœ ํฌ๋กค๋ง์„ ์š”์ฒญํ•˜๋‹ค๋ณด๋ฉด ์ ‘์†์ž๊ฐ€ ๋งŽ์•„์ ธ์„œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ์ด ํ•„์š”ํ•˜๊ฐœ ๋จ => ์ด๋Ÿฌ๋ฉด ๋” ๋งŽ์€ ์ปดํ“จํ„ฐ๊ฐ€ ํ•„์š”ํ•ด์ง€๊ฒŒ ๋จ

import puppeteer from 'puppeteer'

async function startCrawling(){  //ํ•˜๋‚˜์”ฉ ๋‹ค ๊ธฐ๋‹ค๋ ค์ค˜์•ผํ•จ (๋ธŒ๋ผ์šฐ์ € ์—ด๊ณ  ์ฐฝ ์—ด๊ณ )
    const browser = await puppeteer.launch({headless: false}) // ๋ธŒ๋ผ์šฐ์ € ๋‚˜ํƒ€๋‚จ
    const page = await browser.newPage()    // ์ƒˆ ํŽ˜์ด์ง€ ์—ด๊ธฐ
    await page.setViewport({width: 1280, height: 720}) // page ํฌ๊ธฐ๋„ ์ง€์ • ๊ฐ€๋Šฅํ•จ
    await page.goto("https://www.goodchoice.kr/product/search/2")   // chromium ๋ธŒ๋ผ์šฐ์ €๋กœ ์ด๋™ํ•˜๊ฒŒ ๋จ  // chromium์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด์„œ ๋งŒ๋“ค์–ด์ง„ ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ํฌ๋กฌ์ž„ (๋‘˜์€ ์ „ํ˜€ ๋‹ค๋ฅธ๊ฑฐ)
    page.waitForTimeout(1000) // ์ ‘์†ํ•˜๊ณ  ์‹œ๊ฐ„ํ…€์„ ์ฃผ๊ณ  ์ ‘์†ํ•˜๋Š”๊ฑฐ์ž„ 


    const star = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > div > span", (el) => el.textContent)
                             //$eval์€ ํ•œ๊ฐœ์— ๋Œ€ํ•ด์„œ, $$eval์€ ์—ฌ๋Ÿฌ๊ฐœ ์„ ํƒํ•  ๋•Œ       // '>' => ์ž์‹์œผ๋กœ ์žˆ๋Š” ํƒœ๊ทธ                   //div์˜ ์ž์‹์ด span์ด๋‹ค 
    // child()์˜ ์ˆซ์ž๋งŒ ๋‹ค๋ฆ„ => ๋‹ค๋ฅธ ํ˜ธํ…” ์„ฑ๊ธ‰: #poduct_list_area > li:nth-child(3) > a > div > div.name > div > span //=> ์ด๋Ÿฌ๋ฉด for๋ฌธ ๋Œ๋ ค์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ ๊ฐ€๋Šฅ
    page.waitForTimeout(1000)

    const location = await (await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > p:nth-child(4)", (el)=> el.textContent)).trim()
    page.waitForTimeout(1000)

    const price = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.price > p > b", (el) => el.textContent)
    page.waitForTimeout(1000)

    console.log("โญ๏ธ star:", star)
    console.log("๐Ÿ“ location:", location)
    console.log("๐Ÿ’ณ Price:", price)

    await browser.close()    // crawling ๋๋‚˜๋ฉด browser ์ข…๋ฃŒํ•ด์ฃผ๊ธฐ
}

startCrawling()

iframe

  • iframe is a separate page inside the browser. (์‚ฌ์ดํŠธ์— ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์•Œ๋งน์ด)
    --> ifram is a total different page.
    --> The outerShell and the inside is different.
  • Even if the devleoper brings the data by Copy selector, iframe selector doesn't work on the site selector.
  • EX) If I copied $30 product in the market by iframe selector, I'm trying to get the data inside the iframe of the market site.
    --> The accessing site is naver, but the data of iframe is getting pulled out.

profile
์ฝฑใ…†l

0๊ฐœ์˜ ๋Œ“๊ธ€