[EDA test] 04. 국가별 인터넷 사용률 데이터

svenskpotatis·2023년 10월 11일

온라인 데이터를 가져와서 풀어야 하는 문제가 있어서, 문제 제출 당시의 데이터를 기반으로 한 정답과 차이가 있었다. 파일 앞부분에 과거 데이터 파일과 함께 명시되어 있었는데, 제출하고 나서 알았다.. 맞게 풀었는데도 채점오류가 나서 채점 프로그램 파이썬 파일을 직접 확인하고, 채점조건을 비교해서 달라진 데이터를 찾아보는 노가다 방식을 통해 바뀐 데이터를 발견해서 풀었다.

문제 소개 및 데이터 준비 단계

Data 원본 출처

Target Data(CSV): Global Internet Usage(국가별 인터넷 사용률)

Source: Kaggle

DownLoad: archive.zip

Reference Data01(HTML Link): 국가별 인구(Population) Data

Source: Wiki

Reference Data02(CSV): 국가별 ISO코드 / 지역 분류(Region) Data

Source: Kaggle

DownLoad: archive.zip

# 설치 필수
%pip install pycountry
%pip install lxml

1단계: Target Data 불러오기 & 전처리

문제 1-1) Target Data 전처리 01 (5점)

위에서 읽은 DataFrame에서 Null값을 처리하고자 합니다. 아래 조건에 맞게 Null값을 처리하세요.
- 조건1: 'incomeperperson', 'internetuserate', 'urbanrate' Column(열)에 하나라도 Null값이 있다면 그 row(행)를 삭제(drop)하세요.
- 조건2: Index와 순서(order)는 변경하지 마세요.

df_target = df_target.dropna()

문제 1-2) Target Data 전처리 02 (5점)

1-1의 DataFrame(df_target)과 아래의 df_target_change_list를 이용하여 아래 조건에 맞게 국가명(컬럼명: 'country')을 변경하세요.
- 참고: 국가명을 변경하는 이유는 추후(문제 2-4) pycountry Library를 사용하여 국가코드(ex: 대한민국-KR)를 얻기 위함입니다.
- 아래 df_target_change_list는 변경 대상인 df_target의 index와 그에 맞는 국가명이 쌍(tuple)들을 값으로 가지고 있습니다.

df_target_change_list = [
    (33, 'Cabo Verde'),
    (35, 'Central African Republic'), ...

for idx, changed in df_target_change_list:
    df_target.at[idx, 'country'] = changed

문제 1-3) Target Data 전처리 03 (10점)

1-2의 DataFrame(df_target)과 pycountry Library를 이용하여 아래 조건에 맞게 국가코드를 구하세요.
- 참고: pycountry.countries
  - ISO 3166-2(전 세계 나라 및 부속 영토의 주요 구성 단위의 명칭에 고유 부호(코드)를 부여하는 국제 표준) 기준 국가별 코드를 얻을 수 있는 Python Library
  - 국가명 표기 방식에 따라 잘못된 값 또는 값 검색이 안되는 경우가 많아 주의가 필요함
    - 예시: 대한민국의 경우 'south korea', 'republic of korea', 'korea', 'korea, republic of' 등으로 표기되는데, 이 중 'korea, republic of' 로만 정확한 국가 코드를 얻을 수 있음
    - 혼선을 줄이기 위하여 문제 1-2와 같이 변경할 국가명을 제공함
  - 사용법(상세 사용법은 2.Datas의 참고사항 내 링크 참조)
    - 국가 검색 방법
      - 일반 검색: pycountry.countries.get(name=country_name) -> 하나의 결과값을 return
      - fuzzy 검색: pycountry.countries.search_fuzzy(country_name) -> 하나 이상의 결과값을 list형태로 return

import pycountry

country_name = 'korea, republic of'
country = pycountry.countries.get(name=country_name)
country

>>> Country(alpha_2='KR', alpha_3='KOR', common_name='South Korea', flag='🇰🇷', name='Korea, Republic of', numeric='410')

for idx, row in df_target.iterrows():
    country_name = row['country']
    try: 
        country = pycountry.countries.get(name=country_name)
        df_target.loc[idx, 'code'] = country.alpha_2
    except:
        country = pycountry.countries.search_fuzzy(country_name)
        df_target.loc[idx, 'code'] = country[0].alpha_2

2단계: Reference Data01: 불러오기 & 전처리 & 합치기

문제 2-1) Web Data 가져오기 (10점)

from selenium import webdriver
from bs4 import BeautifulSoup

# 페이지 접근 
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
driver = webdriver.Chrome()
driver.get(url)

table = pd.read_html(url)
df_population = table[0]

columns = ['Rank', 'Country / Dependency', 'Population', '% of the world', 'Date', 'Source (official or from the United Nations)', 'Notes']
df_population.columns = columns

문제 2-2) Population Data 전처리 01 (5점)

df_population= df_population.drop(columns=['Rank', '% of the world', 'Date',
       'Source (official or from the United Nations)', 'Notes'])

df_population = df_population.drop(0)
df_population.columns = ['country', 'population']

문제 2-3) Population Data 전처리 02 (5점)

2-2의 DataFrame(df_population)과 아래의 df_population_change_dict를 이용하여 아래 조건에 맞게 국가명(컬럼명: 'country')을 변경하세요.

df_population_change_dict = {
    'Bermuda (UK)': 'Bermuda', ...

df_population['country'] = df_population['country'].replace(df_population_change_dict)

문제 2-4) Data 합치기(10점)

df = pd.merge(df_target, df_population, on='country', how='inner')
df.sort_values(by='code', ascending=True, inplace=True)
df = df.reset_index(drop=True)

3단계: Reference Data02: 불러오기 & 전처리 & 합치기

df_region = pd.read_csv('./datas/continents2.csv')

문제 3-1) Region Data 전처리 01 (10점)

문제 3-2) Data 합치기(10점)

# 조건1, 조건2, 조건3
df_merge = pd.merge(df, df_region, on='code', how='inner')

# 조건4
df_merge.sort_values(by='code', inplace=True)

# 조건5
df_merge = df_merge.rename(columns=df_rename_dict)

# 조건6
df_merge.columns = new_col_order

# 조건7
df_merge = df_merge[new_col_order]

# 조건8
df_merge.reset_index(drop=True, inplace=True)

4단계: Data 분석하기(가중 평균 & 분산)

문제 4-1) 지역대륙별 가준 평균 구하기 (15점)

import pandas as pd

def weighted_average(df):
    weights = df['population']
    values = df[['internet_use_rate', 'income_per_person']]
    return np.average(values, weights=weights, axis=0)

df2 = df.groupby(['region', 'sub_region']).apply(weighted_average)

df2 = df2.reset_index()

df2['weighted_ave_internet'] = [i[0] for i in df2[0]]
df2['weighted_ave_income'] = [i[1] for i in df2[0]]

del df2[0]

df_result = df2.pivot_table(
    index=['region', 'sub_region']
)
df_result = df_result[['weighted_ave_internet', 'weighted_ave_income']]

문제 4-2) 특정 조건의 가준 평균 구하기 (15점)

df3 = df_merge[~df_merge['code'].isin(['CN', 'IN'])]

df3 = df3[df3['sub_region'].isin(['Eastern Asia', 'Southern Asia'])]

df4 = df3.groupby(['region', 'sub_region']).apply(weighted_average)
df4 = df4.reset_index()

df4['weighted_ave_internet'] = [i[0] for i in df4[0]]
df4['weighted_ave_income'] = [i[1] for i in df4[0]]

del df4[0]

df_result = df4.pivot_table(
    index=['region', 'sub_region']
)

df_result = df_result[['weighted_ave_internet', 'weighted_ave_income']]

svenskpotatis

이전 포스트

[EDA test] 03. 올림픽 데이터

다음 포스트