스터디 노트(EDA 2)

zoe·2023년 3월 29일

작성한 코드 보면서 다시보기!

02 강남3구 범죄현황 데이터 개요 및 읽어오기

https://www.data.go.kr/data/15054738/fileData.do

thousands = ',' : 숫자에 ','가 있으면 문자로 인식될 수 있어 콤마 제거하고 숫자형으로 읽어줌
info() : 데이터 개요 확인

crime_raw_data = pd.read_csv("../data/02. crime_in_Seoul.csv", 
                             thousands = ",", encoding="euc-kr")

03 - 04 Pandas의 pivot_table

Pandas pivot table

index, columns, values, aggfunc
pivot_table : 통계표, 이 요약에는 합계, 평균, 기타 통계가 포함될 수 있으며 피벗 테이블이 이들을 함께 의미있는 방식으로 묶어준다.
pd.pivot_table(dataframe, index=[""]) := dataframe.pivot_table(index=[""]) index를 기준으로 그룹핑, index는 여러 개 지정 가능, # values 지정 가능(value에 함수 지정 가능, 디폴트는 평균)
aggfunc= : 합산 등의 다른 함수 적용
columns= : 분류, 없는 데이터는 NaN 값 처리
fill_value = 0 : NaN 값을 다른 값으로 처리, 0을 입력하면 NaN → 0으로 처리
margins=True : 합계 지정

df = pd.read_excel("../data/02. sales-funnel.xlsx")
df.head()

# Name 컬럼을 인덱스로 설정

#pd.pivot_table(df, index="Name")
df.pivot_table(index="Name")

# 멀티 인덱스 설정

df.pivot_table(index=["Name", "Rep", "Manager"])

# 멀티 인덱스 설정

df.pivot_table(index=["Manager", "Rep" ])

df.head()

df.pivot_table(index=["Manager","Rep"], values="Price")

# Price 컬럼 sum 연산 적용

df.pivot_table(index=["Manager","Rep"], values="Price", aggfunc=np.sum)

df.pivot_table(index=["Manager","Rep"], values="Price", aggfunc=[np.sum, len])

df.head()

# product를 컬럼으로 지정

df.pivot_table(index=["Manager","Rep"],
               values="Price",columns = "Product", aggfunc=np.sum)

# fill_value: NaN 값 설정

df.pivot_table(index=["Manager","Rep"],
               values="Price",columns = "Product", aggfunc=np.sum, fill_value=0)

# 2개 이상 index, values 설정

df.pivot_table(index=["Manager", "Rep", "Product"], values=["Price", "Quantity"],
               aggfunc=np.sum, fill_value=0)

# aggfunc 2개 이상 설정

df.pivot_table(index=["Manager", "Rep", "Product"], values=["Price", "Quantity"],
               aggfunc=[np.sum, np.mean], fill_value=0, margins=True)

07 pip명령과 conda명령

pip 명령 :
- python의 공식 모듈 관리자
- pip list : 현재 설치된 모듈 리스트 반환
- pip install module_name : 모듈 설치
- pip uninstall module_name : 설치된 모듈 제거
- !pip list : 주피터 노트북에서는 !를 입력해서 사용
- get_ipython().system() : !가 아닌 바로 코드 export시키는 경우같을 때에 사용
conda 명령 :
- pip을 사용하면 conda 환경에서는 dependency관리가 정확하지 않을 수 있다 아나콘다에서는 가급적 conda 명령으로 모듈을 관리하는 것이 좋다
- conda list : 설치된 모듈 list
- conda install module_name : 모듈 설치
- conda uninstall module_name : 모듈 제거
- conda install -c channel_name moduel_name : 지정된 배포 채널에서 모듈 설치

08 google maps api 사용 준비하기

conda install -c conda-forge googlemaps : anaconda prompt에서 ds_study로 이동해서 실행!
- 이때 Google Map API Key도 필요

10 - 11 python이 for문

python의 반복문
python은 들여쓰기(intent)로 구분

for n in [1, 2, 3, 4]:
  들여쓰기 코드를 작성
  들여쓰기가 적용되는 곳까지
  for문
  그리고
 들여쓰기를 중단하면 for문이 아니다

iterrows() : Pandas에 잘 맞추진 반복문용 명령
- Pandas 데이터 프레임은 대부분 2차원
- 이럴 때 for문을 사용하면 n번째라는 지정을 반복해서 가독률이 떨어짐
- Pandas 데이터 프레임으로 반복문을 만들 때 iterrows()라는 옵션을 사용하면 편함
- 받을 때, 인덱스와 내용으로 나누어 받는 것만 주의

# 간단한 for문 예제

for n in [1, 2, 3, 4]:
    print("Number is ", n)

# 조금 복잡한 for문 예제
for n in range(0, 10):
   print(n ** 2)

# 방금 전 코드를 한 줄로

[n ** 2 for n in range(0, 10)]

13 google maps에서 구별 정보를 얻어서 데이터를 정리

np.nan : Nan 값 생성

14 - 15 구별데이터로 변경하기

index_col= : 인덱스 컬럼으로 특정 컬럼 설정

16 - 17 서울시 범죄현황 데이터 최종 정리

정규화 : 본래의 DataFrame을 두고 정규화된 데이터를 따로 생성
- 최고값 : 1, 최소값 : 0 설정
np.mean() : axis = 1 : 행, axis = 0 : 열 (pandas와 반대)

np.mean(np.array([0.357143, 1.000000, 1.000000, 0.977118, 0.733773]))

np.array([[0.357143, 1.000000, 1.000000, 0.977118, 0.733773],
        [0.285714, 0.358974, 0.310078, 0.477799, 0.463880]])

np.mean(np.array([[0.357143, 1.000000, 1.000000, 0.977118, 0.733773],
        [0.285714, 0.358974, 0.310078, 0.477799, 0.463880]]))

np.mean(np.array([[0.357143, 1.000000, 1.000000, 0.977118, 0.733773],
        [0.285714, 0.358974, 0.310078, 0.477799, 0.463880]]), axis = 1)
# axis = 1 : 행, axis = 0 : 열 (pandas와 반대)

18 - 21 seaborn

plt.rcParams["axes.unicode_minus"] = False : 마이너스 부호 때문에 한글이 깨지는 것을 방지
rc("font", family="Malgun Gothic") # mac : Arial unicode MS, 한글폰트 적용
sns.set_style() : 그래프 배경 설정
- "white", "dark", "whitegrid", "darkgrid" "ticks" 입력 가능
sns.despine(offset = 10) : 축과 살짝 떨어지는 효과
boxplot
- sns.boxplot(x=tips["total_bill"]) : boxplot설정,
- sns.boxplot(x="day", y="total_bill", data = tips) : x축, y축, 사용 데이터 설정
- hue= : 카테고리(컬럼) 구분
- palette= = : 색상, Set1, Set2, Set3까지 존재
swarmplot : 산점도
- color="" : 0 ~ 1 사이의 값 설정 가능, 문자열 형태로 입력해야 함
lmplot : total_bill과 tip사이의 관계 파악, scatter에 추세선 추가
- height = : 마커 크기 (size → height로 변경됨)
- ci = None : 신뢰구간 선택
- scatter_kws={"s":100}
- order = 2 : 2차식에 맞도록 그래프 변경
- roust = True : 이상치 제거하고 추세선 생성, 안될 때 (conda install -c anaconda statsmodels) 실행하고 하면 됨
pivot : 데이터 테이블 재배치, index, columns, values
heatmap
- annot = True : 그래프 내 숫자 표시 여부
- fmt = "d" : 숫자 표현 설정, "d" : 정수형, "f" : 실수형
- cmap ="" : 색상 변경
- linewidths = 0.5, # 간격설정
pairplot : 다수의 컬럼 비교, 특성별 상관관계 바로 파악 가능, 원하는 컬럼만 선택
가능
- kind="" : {'scatter', 'kde', 'hist', 'reg'} 입력 가능

# seaborn 설치

# !conda install -y seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc

plt.rcParams["axes.unicode_minus"] = False 
# plt.rcParams["axes.unicode_minus"] = False : 마이너스 부호 때문에 한글이 깨지는 것을 방지

rc("font", family="Malgun Gothic") # mac : Arial unicode MS

# 주피터 노트북에서 그래프를 그리기 위한 설정
# %matplotlib inline  아래 것과 동일
get_ipython().run_line_magic("matplotlib","inline")

# 예제 1 그래프 기초

x = np.linspace(0, 14, 100) # 0 ~ 14까지의 수를 100개 생성
x

y1 = np.sin(x)
y2 = 2 * np.sin(x + 0.5)
y3 = 3 * np.sin(x + 1.0)
y4 = 4 * np.sin(x + 1.5)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, x, y2, x, y3, x, y4) # x축 데이터, y축 데이터
plt.show()

# sns.set_style() : 그래프 배경 설정
# "white", "dark", "whitegrid", "darkgrid" "ticks" 입력 가능

sns.set_style("white")  
plt.figure(figsize=(10, 6))
plt.plot(x, y1, x, y2, x, y3, x, y4)
plt.show()

sns.set_style("dark")  
plt.figure(figsize=(10, 6))
plt.plot(x, y1, x, y2, x, y3, x, y4)
plt.show()

sns.set_style("whitegrid")  
plt.figure(figsize=(10, 6))
plt.plot(x, y1, x, y2, x, y3, x, y4)
sns.despine(offset=10) # sns.despine(offset = 10) : 축과 살짝 떨어지는 효과
plt.show()

# 예제 2 seaborn tips data
tips = sns.load_dataset("tips")  # 실습용 내장 데이터
tips

tips.info()

# boxplot
# sns.boxplot(x=tips["total_bill"])  : boxplot설정, 
# x = dataframe[""] : x축 값 설정

plt.figure(figsize = (8, 6))
sns.boxplot(x=tips["total_bill"])


plt.show()

tips["day"].unique()

# boxplot
# sns.boxplot(x="day", y="total_bill", data = tips)  : x축, y축, 사용 데이터 설정

plt.figure(figsize = (8, 6))
sns.boxplot(x="day", y="total_bill", data=tips)


plt.show()

# boxplot hue, palette option
# hue= : 카테고리(컬럼) 구분
# palette = : 색상, Set1, Set2, Set3까지 존재

plt.figure(figsize = (8, 6))
sns.boxplot(x="day",y="total_bill",data=tips, hue="smoker", palette="Set2")

# swarmplot : 산점도
# color="" : 0 ~ 1 사이의 값 설정 가능, 문자열 형태로 입력해야 함

plt.figure(figsize=(8, 6))
sns.swarmplot(x="day", y="total_bill", data=tips, color="0.2")
plt.show()

# boxplot과 swarmplot 함께 사용
# 겹쳐서 사용 가능

plt.figure(figsize = (8,6))
sns.boxplot(x="day", y="total_bill", data=tips)
sns.swarmplot(x="day", y="total_bill", data=tips, color = "0.25")
plt.show()

tips

# lmplot : total_bill과 tip사이의 관계 파악
# lmplot : scatter에 추세선 추가
# height = : 크기 (size → height로 변경됨)

sns.set_style("darkgrid")
sns.lmplot(x="total_bill", y="tip", data=tips, height=10) 
plt.show()

# hue option

sns.set_style("darkgrid")
sns.lmplot(x="total_bill", y="tip", data = tips, height=7, hue="smoker")
plt.show()

# 예제 3 flight data
flights = sns.load_dataset("flights")
flights

flights.info()

# pivot : 데이터 테이블 재배치, index, columns, values


flights = flights.pivot(index="month", columns="year",values="passengers")
flights

# heatmap
# annot = True : 그래프 내 숫자 표시 여부
# fmt = "d" : 숫자 표현 설정, "d" : 정수형, "f" : 실수형

plt.figure(figsize = (10, 8))
sns.heatmap(data=flights, annot = True, fmt="d")
plt.show()

# colormap

plt.figure(figsize = (10, 8))
sns.heatmap(flights, annot=True, fmt="d", cmap = "YlGnBu")
plt.show()

# 예제 4 iris
iris = sns.load_dataset("iris")
iris.head()

# pairplot : 다수의 컬럼 비교, 특성별 상관관계 바로 파악 가능

sns.set(style = "ticks")
sns.pairplot(iris)
plt.show()

iris

iris["species"].unique()

# hue option

sns.pairplot(iris, hue = "species")
plt.show()

# 원하는 컬럼만 pairplot

sns.pairplot(iris, x_vars=["sepal_width", "sepal_length"], 
             y_vars=["petal_width", "petal_length"])
plt.show()

# 예제 5 anscombe
# lmplot

anscombe = sns.load_dataset("anscombe")
anscombe.head()

anscombe["dataset"].unique()

# ci = None : 신뢰구간 선택

sns.set_style("darkgrid")
sns.lmplot(x="x", y="y", data = anscombe.query("dataset == 'I'"),
           ci = None, height=7)
plt.show()

sns.set_style("darkgrid")
sns.lmplot(x="x", y="y", data = anscombe.query("dataset == 'I'"),
           ci = None,scatter_kws={"s":100}, height=7)
plt.show()

# order option
# order = 2 : 2차식에 맞도록 그래프 변경

sns.set_style("darkgrid")
sns.lmplot(x="x", y="y", data = anscombe.query("dataset == 'II'"), order = 2,
           ci = None, scatter_kws={"s":100},  height=7)
plt.show()

# outlier
# roust = True : 이상치 제거하고 추세선 생성, 안될 때 (conda install -c anaconda statsmodels) 실행하고 하면 됨

sns.set_style("darkgrid")
sns.lmplot(x="x", y="y", data = anscombe.query("dataset == 'III'"),
           robust = True, 
           ci = None, scatter_kws={"s":80}, height=7)
plt.show()

25 - 28 Folium 지도 시각화

folium.Map() : 지도
- location=[] : tuple or list, default None, 위도 경도 입력
- zoom_start= : 확대, 0 ~ 18 입력가능
- save("path") : 생성한 folium map 저장, html 형태로 저장됨
tiles option
- tiles = : 지도의 스타일 변경
- "OpenStreetMap"
- "Mapbox Bright" (Limited levels of zoom for free tiles)
- "Mapbox Control Room" (Limited levels of zoom for free tiles)
- "Stamen" (Terrain, Toner, and Watercolor)
- "Cloudmade" (Must pass API key)
- "Mapbox" (Must pass API key)
- "CartoDB" (positron and dark_matter)
forlium.Marker()
- 지도에 마커 생성
- location= : 위도, 경도 입력
- popup= : 마커 클릭 시 보일 메세지 입력, html 형식 설정 가능 ex) 굵게, 팝업..
- tooltip = : 마우스 오버 시 보일 메세지 입력, html 형식 설정 가능 ex) 굵게
folium.Icon()
- 아이콘 참고 사이트 : https://getbootstrap.com/docs/3.3/components/
- 아이콘 참고 사이트 : https://fontawesome.com/search?m=free&o=r
folium.ClickForMarker() : 지도 위에 마우스로 클릭했을 때 마커 생성
- popup= : 설정을 하면 클릭 시 해당 문구가 보임, 안하면 위도, 경도를 나타내줌 디폴트
folium.LatLngPopup() : 지도를 마우스로 클릭했을 때 위도, 경도 정보를 반환
folium.Circle().folium.CircleMarker() : 원형 표시
- Circle(), CircleMarker() : 거의 동일 큰 차이 없음
- color="" : 원형 색 설정
- fill_color = "" : 원형 내 색 설정
- fill = : 원에 안을 채워주는 옵션 True / False값 입력
folium.Choropleth
- data = : Series or DataFrame 형태 데이터 가능
- fill_opacity=0.5 : 0 ~ 1 사이의 값
- line_opacity=1 : 0 ~ 1 사이의 값
아파드 유형 지도 시각화
- 공공데이터 포털 (https://www.data.go.kr/data/15066101/fileData.do)
read.csv할 때 한글 인코딩 : encoding = "cp949" 혹은 encoding="euc-kr" , encoding = "utf-8" 가능
df = df.reset_index(drop=True)
- drop=True : 인덱스를 따로 컬럼으로 생기게 할지 여부,
- True : 생기지 않게 하겠다
- False : 생기게 하겠다, default값

# window, mac(intel, m1)

#!pip install folium

#  안될경우, window

# !pip install charset
# !pip install charest-normalizer

import folium
import pandas as pd
import json

# folium.Map()

m = folium.Map(location=[37.544779, 127.055966], zoom_start=14) 
m

# save("path") : 생성한 folium map 저장, html 형태로 저장됨

m.save("./folium.html")

# tiles = : 지도의 스타일 변경
m = folium.Map(location=[37.544779, 127.055966],
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 
m

m = folium.Map(location=[37.544779, 127.055966], # 성수역
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 
m

# 뚝섬역
folium.Marker((37.548634, 127.044222)).add_to(m)
# 성수역
folium.Marker(location=[37.544779, 127.055966], popup="<b>Subway</b>" #<b></b> : 굵게
             ).add_to(m)

# tooltip
folium.Marker(location=[37.544779, 127.055966], 
              popup="<b>Subway</b>", #<b></b> : 굵게
              tooltip = "성수역"
             ).add_to(m)


# html
folium.Marker(location=[37.548660, 127.058221], 
              popup="<a href='https://zero-base.co.kr/' target=_'blink'>제로베이스<a>", #<b></b> : 굵게
              tooltip = "모르는 길"
             ).add_to(m)


m

# folium.Icon()
# 아이콘 참고 사이트 : https://getbootstrap.com/docs/3.3/components/
# 아이콘 참고 사이트 : https://fontawesome.com/search?m=free&o=r


m = folium.Map(location=[37.544779, 127.055966], # 성수역
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 
m

# icon basic
folium.Marker((37.548634, 127.044222), 
              icon=folium.Icon(color="black", icon='info-sign')
             ).add_to(m)
# icon color
folium.Marker(location=[37.544779, 127.055966],
              popup="<b>Subway</b>", #<b></b> : 굵게
              tooltip="Icon color",
              icon = folium.Icon(color="red", icon_color="blue", icon="cloud")
             ).add_to(m)


# Icon custom
folium.Marker(location=[37.540619, 127.069201], # 건대입구
              popup="건대입구역",
              tootip="Icon custom",
              icon = folium.Icon(color = "purple", 
                                 icon_color="white",icon="glyphicon glyphicon-signal", angle=50,
                                prefix="glyphicon")
              
              
             ).add_to(m)


m

# folium.ClickForMarker()

m = folium.Map(location=[37.544779, 127.055966], # 성수역
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 
m.add_child(folium.ClickForMarker(popup="ClickForMarker")) 
#popup= 설정을 안하면 위도, 경도를 나타내줌 디폴트

# folium.LatLngPopup()
m = folium.Map(location=[37.544779, 127.055966], # 성수역
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 
m.add_child(folium.LatLngPopup())

# folium.Circle().folium.CircleMarker()

m = folium.Map(location=[37.544779, 127.055966], # 성수역
               zoom_start=14,
               tiles = "OpenStreetMap"
              ) 

# Circle
folium.Circle(
    location=[37.558740, 127.045299], # 한양대
    radius = 100,
    fill = True, # fill = : 원에 안을 채워주는 옵션 True / False값 입력
    color = "green",
    fill_color = "red",
    popup="Circle",
    tooltip="Circle"
).add_to(m)

# CircleMarker
folium.CircleMarker(
    location=[37.545879, 127.037402], # 서울숲
    radius = 100,
    fill = True, # fill = : 원에 안을 채워주는 옵션 True / False값 입력
    color = "blue",
    fill_color = "pink",
    popup="CircleMarker",
    tooltip="CircleMarker"
).add_to(m)


m

import json

state_date = pd.read_csv("../data/02. US_Unemployment_Oct2012.csv")
state_date.tail(2)

m = folium.Map([43, -102], zoom_start=3)

folium.Choropleth(
    geo_data="../data/02. us-states.json", # 경계선 좌표값이 담긴 데이터
    data = state_date, #data = : Series or DataFrame 형태 데이터 가능
    columns=["State","Unemployment"],
    key_on="feature.id",
    fill_color="BuPu",
    fill_opacity=0.5, # 0 ~ 1 사이의 값
    line_opacity=1, # 0 ~ 1 사이의 값
    legend_name="Unemplyment rate (%)"
).add_to(m)

m

import pandas as pd

df = pd.read_csv("../data/02. 서울특별시 동작구_주택유형별 위치 정보 및 세대수 현황_20220818.csv"
                 ,encoding="cp949") # encoding = euc-kr도 가능

df

df.info()

# NaN 데이터 제거

df = df.dropna()
df.info()

df = df.reset_index(drop=True) 
#drop=True : 인덱스를 따로 컬럼으로 생기게 할지 여부, True : 생기지 않게 하겠다, False : 생기게 하겠다, default값

df.tail()

# 연번, 분류 이름 수정
df.columns

df = df.rename(columns={"연번 ":"연번", "분류 ":"분류"})
df.columns

# 연번 삭제
del df["연번"]

df.tail()

df.위도

df.describe()

# folium

m = folium.Map(location=[37.50589466533131, 126.93450729567374], zoom_start=13)

for idx, rows in df.iterrows():
    
    #location
    lat, lng = rows.위도, rows.경도
    
    #Markr
    folium.Marker(
        location=[lat, lng],
        popup=rows.주소,
        tooltip=rows.분류,
        icon=folium.Icon(
            icon="home", color= "rightred" if rows.세대수 >= 199 else "lightblue",
            icon_color="darkred" if rows.세대수 >= 199 else "darkblue")
        
    ).add_to(m)

    # Circle
    folium.Circle(
        location=[lat, lng],
        radius = rows.세대수 * 0.2,
        fill=True,
        color="pink" if rows.세대수 >= 518 else "green",
        fill_color= "pink" if rows.세대수 >= 518 else "green"
    ).add_to(m)

m