[BigData] Data Collection & Exploration

Y_Y·2022년 10월 4일


목록 보기

Data Collection

Consideration on Data Collection

  • Does the dataset exist (or can we build the dataset) ?
    - If NOT, the DM problem tou defined cannnot be conducted

  • If exists, specify the target data source with considering
    - Availability
    - Access type (API, download) -> Need to find "limits or warnings"

Try to spend much time on searching the priviously collected or publicly available datasets

  1. search API, existed Dataset (someone already made crawling)

Considerations on Data Collection (Cont'd)

  • Decide the scale of the dataset & what to store
    - Reasonable amount to solve your problem
    - Storeage capability
  • Decide the collection methodology
    - Crawling web pages or using APIs
    - Multiprocessing vs. single processing
  • TIP : Collect data as much as you can store
    - Re-collection is too much expensive
    - Nobody (even you) knows what to use
    - During analysis, you may need new data that you ignored


트위터에서 수집할 때 한국, 미국(서부, 동부)에 따라 다를 수 있다.
클라우드 프록시를 사용해서 데이터를 수집할 수 있다.
최대한 law 데이터를 저장해라 -> 어떤 다른 길로 갈 수 있도록

Data Preprocessing

Data Preprocessing : Overview

  • The first step to ensure "data quality"
    - Accuracy : Correct or wrong, accurate or not
    - Completeness : Not recorded, unavailable
    - Consistency : Some modified but some not, dangling
    - Timeliness : Timely update?
    - Believability : How trusteable the data are correct ?
    - Interpretability : How easily the data can be understood?

-> 선택과 가정의 단계에서 preproccesing이 진행된다.

  • Manipulating data for your intended use!
    - Note : preprocessing should be conducted REASONABLY

Major Tasks in Data Preprocessing

  • Data cleaning
    - Fill
  • Data integration
  • Data transforation and data discretization
  • Data reduction

Data Cleaning

  • Data in the Real World is Dirdy: Lots of potentially incorrect data, ex) instrument faulty, human or computer error, transmission error
    - Incomplete
    - Noisy
    - Inconsistent
    - Intentional

Incomplete (Missing) Data

  • Data is not always available
    - ex) Many tuples have no recorded value for several attributes, such as customer income in sales data
  • Missing data maybe due to

How to Handle Missing Data?

  • Ignore the tuple : usually done when class label is missing (when doing classification)
  • Fill in the missing value manually
  • Fill in it automatically with

Dealing with Noisy Data

  • Noise : Random error or variance in a measured variable
  • Incorrect attribute values

Dealing with Noisy Data (cont'd)

  • Binning (
  • Regression
    - smooth bu fitting the data into regression functions
  • Clustering
    - detect and remove outliers
  • Combined computer and human inspection
    - detect suspicious values and check by human (deal with possible outliers)

Data Integration

하나의 구분자를 가지고 데이터를 모으는 것

  • Data integration : combines data from multiple sources into a coherent store
  • Schema integration : ex) cust-id, cust-#
  • Entity identification problem
  • Detecting and resolving data value conflicts
남을 위해(나를 위해) 글을 쓰는 Velog

0개의 댓글