9. Data 어떻게 표현하면 좋을까?_LMS Node 브레이커

허남철·2021년 12월 31일

LMS Node 브레이커

목록 보기

4/4

배열은 가까이에~ 기본 통계 데이터를 계산해 볼까?

평균 계산하기

total = 0 #총합 초기화
count = 0 #갯수 초기화
numbers = input("Enter a number :  (<Enter Key> to quit)") # 수입력받기
while numbers != "": # 빈문자열을 받으면 중단
    try:
        x = float(numbers) #실수로 변환
        count += 1 #갯수 1증가
        total = total + x # 총합에 합산
    except ValueError: # 에러처리
        print('NOT a number! Ignored..')
    numbers = input("Enter a number :  (<Enter Key> to quit)") # 다음수 입력받기
avg = total / count # 평균 계산
print("\n average is", avg) # 출력

배열을 활용한 평균, 표준편차, 중앙값 계산

표준편차(std)

![Untitled](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1555f5d9-ffd8-46b3-9e59-3bb7e6b279ff/Untitled.png)

xi는 '입력받은 숫자들'> 모든 숫자 저장필요!

입력한 수 배열로 만들기 (리스트, append사용)

def numbers():
    X=[]    # X에 빈 리스트를 할당합니다.
    while True:
        number = input("Enter a number (<Enter key> to quit)") 
        while number !="":
            try:
                x = float(number)
                X.append(x)    # float형으로 변환한 숫자 입력을 리스트에 추가합니다.
            except ValueError:
                print('>>> NOT a number! Ignored..')
            number = input("Enter a number (<Enter key> to quit)")
        if len(X) > 1:  # 저장된 숫자가 2개 이상일 때만 리턴합니다.
            return X

X=numbers()

print('X :', X)

>>>
Enter a number (<Enter key> to quit) 1
Enter a number (<Enter key> to quit) 5
Enter a number (<Enter key> to quit) 7
Enter a number (<Enter key> to quit) 8
Enter a number (<Enter key> to quit) 
X : [1.0, 5.0, 7.0, 8.0]

파이썬의 리스트 : 가변적 연속열(Sequence)형, 동정 배열(Dynamic Array)

참고) 리스트와 배열

자료구조, 리스트(List) 와 배열(Array)

배열(Array)

import array as arr

# 이것은 array입니다. import array를 해야 쓸 수 있습니다.
myarray = arr.array('i', [1, 2, 3])   
print(type(myarray))

# 아래 라인의 주석을 풀고 실행하면 에러가 납니다.
#myarray.append('4')    
# myarray의 끝에 character '4'를 추가합니다. '4'는 문자열, 타입이 다르다.

print(myarray)

myarray.insert(1, 5)    # myarray의 두번째 자리에 5를 끼워넣습니다.
print(myarray)

>>>
<class 'array.array'>
array('i', [1, 2, 3])
array('i', [1, 5, 2, 3])

import array as arr 필요(built-in 아님)
요소 유형을 지정해서 생성, 다른 타입의 요소 추가 불가능 (Numpy array 동일)
element들리 연속된 메모리 공간에 배치, 모든 element들이 동일한 크기와 타입을 가져야 한다.

중앙값(median) : 크기 배치에서 중앙에 위치하는 값, (짝수이면 중앙2개의 평균)

def median(nums):  		# nums : 리스트를 지정하는 매개변수
    nums.sort()					# sort()로 리스트를 순서대로 정렬
    size = len(nums)
    p = size // 2
    if size % 2 == 0:		   # 리스트의 개수가 짝수일때 
        pr = p                         # 4번째 값
        pl = p-1                      # 3번째 값
        mid= float((nums[pl]+nums[pr])/2)    
    else:								# 리스트의 개수가 홀수일때
        mid = nums[p]
    return mid

print('X :', X)
median(X)						# 매개변수의 값으로 X를 사용함

표준편차, 평균

def means(nums):
    total = 0.0
    for i in range(len(nums)):
        total = total + nums[i]
    return total / len(nums)

means(X)

avg = means(X)

def std_dev(nums, avg):
   texp = 0.0
   for i in range(len(nums)):
       texp = texp + (nums[i] - avg)**2    # 각 숫자와 평균값의 차이의 제곱을 계속 더한 후
   return (texp/len(nums)) ** 0.5    # 그 총합을 숫자개수로 나눈 값의 제곱근을 리턴합니다.

std_dev(X,avg)

전체 코드 : main()함수

```python
med = median(X)
avg = means(X)
std = std_dev(X, avg)
print("당신이 입력한 숫자{}의 ".format(X))
print("중앙값은{}, 평균은{}, 표준편차는{}입니다.".format(med, avg, std))
```

전체코드

def numbers():
    X=[]
    while True:
        number = input("Enter a number (<Enter key> to quit)") 
        while number !="":
            try:
                x = float(number)
                X.append(x)
            except ValueError:
                print('>>> NOT a number! Ignored..')
            number = input("Enter a number (<Enter key> to quit)")
        if len(X) > 1:
            return X

def median(nums): 
    nums.sort()
    size = len(nums)
    p = size // 2
    if size % 2 == 0:
        pr = p
        pl = p-1
        mid = float((nums[pl]+nums[pr])/2)
    else:
        mid = nums[p]
    return mid

def means(nums):
    total = 0.0
    for i in range(len(nums)):
        total = total + nums[i]
    return total / len(nums)

def std_dev(nums, avg):
   texp = 0.0
   for i in range(len(nums)):
       texp = texp + (nums[i] - avg) ** 2
   return (texp/len(nums)) ** 0.5

def main():
    X = numbers()
    med = median(X)
    avg = means(X)
    std = std_dev(X, avg)
    print("당신이 입력한 숫자{}의 ".format(X))
    print("중앙값은{}, 평균은{}, 표준편차는{}입니다.".format(med, avg, std))

if __name__ == '__main__':
    main()

끝판왕 등장! NumPy로 이 모든 걸 한방에!

NumPy 소개 NumPy Numerical Python : 과학 계산용 고성능 컴퓨팅, 데이터 분석 pip install numpy (pip : package installer for python, 파이썬 전용 패키지 설치 소프트웨어) 책읽는 나 : 네이버 블로그 장점
- 빠르다.
- 메모리 효율적사용, 벡터 산술연산 및 브로드캐스팅 연산을 ndarray 데이터 타입으로 지원
- 반복문 없이 전체데이터배열에 빠른 연산을 제공하는 다양한 표준 수학함수 제공
- 배열데이터를 디스크에 쓰거나 읽을 수 있다.
- 선형대수, 난수발생기, 푸리에 변환 가능(C/C++ 포트란으로 쓰여진 코드를 통합)

NumPy 주요 기능

ndarray 만들기

import numpy as np

# 아래 A와 B는 결과적으로 같은 ndarray 객체를 생성합니다. 
A = np.arange(5)
B = np.array([0,1,2,3,4])  # 파이썬 리스트를 numpy ndarray로 변환

# 하지만 C는 좀 다를 것입니다. 
C = np.array([0,1,2,3,'4'])

# D도 A, B와 같은 결과를 내겠지만, B의 방법을 권합니다. 
D = np.ndarray((5,), np.int64, np.array([0,1,2,3,4]))

print(A)
print(type(A))
print("--------------------------")
print(B)
print(type(B))
print("--------------------------")
print(C)
print(type(C))
print("--------------------------")
print(D)
print(type(D))

>>>
[0 1 2 3 4]
<class 'numpy.ndarray'>
--------------------------
[0 1 2 3 4]
<class 'numpy.ndarray'>
--------------------------
['0' '1' '2' '3' '4'] # '4'가 하나들어 갔으나 0123모두 문자여로 바뀜
<class 'numpy.ndarray'> # element 타입을 동일하게 해준다.
--------------------------
[0 1 2 3 4]
<class 'numpy.ndarray'>

크기(size, shape, ndim) size, shape, ndim : 원소의 개수, 행렬의 모양, 행렬의 축(axis)의 개수 reshape : 행렬의 모양을 바꿔준다. 총 원소의 개수가 같아야 한다.

type Numpy : numpy.array.dtype : ‘원소’의 데이터 타입을 반환 파이썬 : type()

A= np.arange(6).reshape(2, 3)
print(A)
print(A.dtype)
print(type(A))
print("-------------------------")

B = np.array([0, 1, 2, 3, 4, 5])  
print(B)
print(B.dtype)
print(type(B))
print("-------------------------")

C = np.array([0, 1, 2, 3, '4', 5])
print(C)
print(C.dtype)
print(type(C))
print("-------------------------")

D = np.array([0, 1, 2, 3, [4, 5], 6])  # 이런 ndarray도 만들어질까요?
print(D)
print(D.dtype)
print(type(D))

>>>
[[0 1 2]
 [3 4 5]]
int64
<class 'numpy.ndarray'>
-------------------------
[0 1 2 3 4 5]
int64
<class 'numpy.ndarray'>
-------------------------
['0' '1' '2' '3' '4' '5']
<U21
<class 'numpy.ndarray'>
-------------------------
[0 1 2 3 list([4, 5]) 6] #최상위 객체인 object를 dtype으로 하여 일치시킨다.
object
<class 'numpy.ndarray'>
/tmp/ipykernel_14/130055213.py:19: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  D = np.array([0, 1, 2, 3, [4, 5], 6])  # 이런 ndarray도 만들어질까요?

'... you must specify 'dtype=object' when creating the ndarray.' dtype을 직접 입력하여 오류를 방지

#NumPy가 행렬 내부의 원소의 type을 실제로 변경할까요?

C = np.array([0,1,2,3,'4',5])
print(C[0])
print(type(C[0]))
print(C[4])
print(type(C[4]))
print("------------------------------")

D = np.array([0,1,2,3,[4,5],6], dtype=object)
print(D[0])
print(type(D[0]))
print(D[4])
print(type(D[4]))

>>>
0
<class 'numpy.str_'> # int 는 str 으로 변환하였으나
4
<class 'numpy.str_'>
------------------------------
0
<class 'int'> # 여기서는 list로 바꾸지 않음.
[4, 5]
<class 'list'>

넘파이는 스마트하다. 어떤 경우라도 배열 답게 연산속도를 최적화 하도록 원소를 관리. 필요에 따라 가장 효율적인 방법으로 type을 변환하여 관리.

특수행렬

# 단위행렬
np.eye(3)
>>>
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

# 0 행렬
np.zeros([2,3])
>>>
array([[0., 0., 0.],
       [0., 0., 0.]])

# 1행렬
np.ones([3,3])
>>>
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

브로드캐스트(broadcast) 연산 ndarray와 상수, 또는 서로 크기가 다른 ndarray끼리 산술연산 Broadcasting - NumPy v1.23.dev0 Manual Untitled

#활용 예
from numpy import array, argmin, sqrt, sum
observation = array([111.0, 188.0]) #관찰하는 선수의 체중,신장
codes = array([[102.0, 203.0], #각 운동선수의 평균 체중, 신장
               [132.0, 193.0],
               [45.0, 155.0],
               [57.0, 173.0]])
diff = codes - observation    # the broadcast happens here #
dist = sqrt(sum(diff**2,axis=-1))
argmin(dist)

아래 처럼 최단거리를 찾는 연산. Untitled

슬라이스와 인덱싱 리스트와 동일하다. 행과 열을 각각 슬라이싱 및 인덱싱한다. np.array[ 행 , 열 ] Untitled

A = np.arange(9).reshape(3,3)
print("A:", A)
>>>
A: [[0 1 2]
 [3 4 5]
 [6 7 8]]

print(A[0, 1])
>>>
1

print(A[:,2:])
>>>
[[2]
 [5]
 [8]]

random

# 다양한 난수 지원

# 의사 난수를 생성하는 예제입니다. 여러번 실행해 보세요.

print(np.random.random())   
# 0에서 1사이의 실수형 난수 하나를 생성합니다. 

print(np.random.randint(0,10))  
 # 0~9 사이 1개 정수형 난수 하나를 생성합니다. 

print(np.random.choice([0,1,2,3,4,5,6,7,8,9]))   
# 리스트에 주어진 값 중 하나를 랜덤하게 골라줍니다.

# 무작위로 섞인 배열을 만들어 줍니다. 
# 아래 2가지는 기능면에서 동일합니다. 

print(np.random.permutation(10))   
print(np.random.permutation([0,1,2,3,4,5,6,7,8,9]))

# 아래 기능들은 어떤 분포를 따르는 변수를 임의로 표본추출해 줍니다. 

# 이것은 정규분포를 따릅니다.
print(np.random.normal(loc=0, scale=1, size=5))    
# 평균(loc), 표준편차(scale), 추출개수(size)를 조절해 보세요.

# 이것은 균등분포를 따릅니다. 
print(np.random.uniform(low=-1, high=1, size=5))  
# 최소(low), 최대(high), 추출개수(size)를 조절해 보세요.

전치행렬 arr.T : 행과 열 바꾸기 np.transpose : 축기준 행과 열 바꾸기

A = np.arange(24).reshape(2,3,4)
print("A:", A)               # A는 (2,3,4)의 shape를 가진 행렬입니다. 
print("A의 전치행렬:", A.T)            
print("A의 전치행렬의 shape:", A.T.shape) 
# A의 전치행렬은 (4,3,2)의 shape를 가진 행렬입니다.
>>>
A: [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
A의 전치행렬: [[[ 0 12]
  [ 4 16]
  [ 8 20]]

 [[ 1 13]
  [ 5 17]
  [ 9 21]]

 [[ 2 14]
  [ 6 18]
  [10 22]]

 [[ 3 15]
  [ 7 19]
  [11 23]]]
A의 전치행렬의 shape: (4, 3, 2)

# np.transpose는 행렬의 축을 어떻게 변환해 줄지 임의로 지정해 줄 수 있는 일반적인 행렬 전치 함수입니다. 
# np.transpose(A, (2,1,0)) 은 A.T와 정확히 같습니다.

B = np.transpose(A, (2,0,1))
print("A:", A)             
# A는 (2,3,4)의 shape를 가진 행렬입니다. 
print("B:", B)             
# B는 A의 3, 1, 2번째 축을 자신의 1, 2, 3번째 축으로 가진 행렬입니다.
print("B.shape:", B.shape)  
# B는 (4,2,3)의 shape를 가진 행렬입니다.

NumPy로 기본 통계 데이터 계산해 보기 넘파이를 이용한 합, 평균, 표준편차, 중앙값 구하기

import numpy as np

def numbers():
    X = []
    number = input("Enter a number (<Enter key> to quit)") 
    # 하지만 2개 이상의 숫자를 받아야 한다는 제약조건을 제외하였습니다.
    while number != "":
        try:
            x = float(number)
            X.append(x)
        except ValueError:
            print('>>> NOT a number! Ignored..')
        number = input("Enter a number (<Enter key> to quit)")
    return X

def main():
    nums = numbers()       # 이것은 파이썬 리스트입니다. 
    num = np.array(nums)   # 리스트를 Numpy ndarray로 변환합니다.
    print("합", num.sum())
    print("평균값",num.mean())
    print("표준편차",num.std())
    print("중앙값",np.median(num))   # num.median() 이 아님에 유의해 주세요.

main()

>>>
Enter a number (<Enter key> to quit) 1
Enter a number (<Enter key> to quit) 2
Enter a number (<Enter key> to quit) 3
Enter a number (<Enter key> to quit) 4
Enter a number (<Enter key> to quit) 5
Enter a number (<Enter key> to quit) 6
Enter a number (<Enter key> to quit) 7
Enter a number (<Enter key> to quit) 8
Enter a number (<Enter key> to quit) 9
Enter a number (<Enter key> to quit) 10
Enter a number (<Enter key> to quit) 
합 55.0
평균값 5.5
표준편차 2.8722813232690143
중앙값 5.5

데이터의 행렬 변환

데이터의 행렬 변환 A Visual Intro to NumPy and Data Representation
이미지의 행렬 변환 관련 라이브러리
- matplotlib, PIL : 이미지파일을 열고 자르고, 복사하고, rgb생각값 가져오는 등 이미지파일 관련 작업 수행.
  
  간단한 이미지 조작
- open : Image.open() : PIL.JpegImagePlugin.JpegImageFile 타입을 갖는다.
- size : Image.size : 가로X세로가 각각 튜플 값으로 반환
- filename : Image.filename
- crop : Image.crop((x0, y0, xt, yt)) : 가로세로 시작점, 가로세로 종료점
- resize : Image.resize((w,h))
- save : Image.save()
  
  img_arr = np.array(img)가 정상동작
  
  PIL.Image.Image 클래스는 리스트를 상속받지 않았지만 array_interface라는 속성이 정의 되어 있다. 덕분에 Pillow 라이브러리는 손쉽게 이미지를 Numpy ndarray로 변환 가능
  
  JpegImageFile Class - PIL documentation
  
  The Array Interface - NumPy v1.21 Manual
  
  이미지 조작은 데이터 증강에 많이 사용
  
  Data augmentation | TensorFlow Core

구조화된 데이터란?

구조화된 데이터란? 숫자, 텍스트 > number, string 형태로 저장 연속열(Sequence) > 배열(리스트, 튜플, ndarray) 키(key)로 접근하는 자료 > 해쉬(hash). 매핑(mapping), 연관배열(associative array) 딕셔너리(dict) 라고 함.

딕셔너리(dictionary)를 활용한 간단한 판타지 게임 logic 설계

treasure_box = {'rope': {'coin': 1, 'pcs': 2},
                'apple': {'coin': 2, 'pcs': 10},
                'torch': {'coin': 2, 'pcs': 6},
                'gold coin': {'coin': 5, 'pcs': 50},
                'knife': {'coin': 30, 'pcs': 1},
               	'arrow': {'coin': 1, 'pcs': 30}
               }
treasure_box['rope']

데이터 내부에 자체적인 서브 구조를 가지는 데이터를 구조화된 데이터라고 한다.

구조화된 데이터와 Pandas

판다스 특징
- 넘파이 기반에서 개발> 넘파이 사용 어플에 쉽게 적용
- 축 이름에 따라 데이터를 정렬
- 다양한 방식의 인덱싱
- 통합된 시계열, 시계열 데이터 비시계열 데이터터를 함께 다룰 통합자료
- 누락된 데이터 처리기능
- 데이터베이스처럼 데이터 합치고 관계 연산 수행

Series 일련의 객체를 담을 수 있는 1차원 배열과 비슷한 자료 구조 배열 형태(리스트, 튜플, 딕셔너리, 넘파이자료형(정수형, 실수형)) 으로 만들 수 있다.

import pandas as pd
ser = pd.Series(['a','b','c',3])
ser
>>>
0    a
1    b
2    c
3    3
dtype: object

ser.values
>>>
array(['a', 'b', 'c', 3], dtype=object)

ser.index
>>>
RangeIndex(start=0, stop=4, step=1)

#인덱스에 다른 값을 넣을 수 있다.
ser2 = pd.Series(['a', 'b', 'c', 3], index=['i','j','k','h'])

ser2.index = ['Jhon', 'Steve', 'Jack', 'Bob']

#인덱스 타입이 RangeIndex가 아닌 Index타입의 객체가 표시됨.
ser2.index
>>>
Index(['Jhon', 'Steve', 'Jack', 'Bob'], dtype='object')

# Series 에서 인덱스는 기본적으로 정수 형태로 설정.
# 할당 가능 = 인덱스가 list인면서 ,딕셔너리의 키와 같은 기능.
#따라서 파이썬 dict 도 Series 객체로 변환 가능하다.
Country_PhoneNumber = {'Korea': 82, 'America': 1, 'Swiss': 41, 'Italy': 39, 'Japan': 81, 'China': 86, 'Rusia': 7}
ser3 = pd.Series(Country_PhoneNumber)
ser3
>>>
Korea      82
America     1
Swiss      41
Italy      39
Japan      81
China      86
Rusia       7
dtype: int64
# 딕셔너리의 키가 인덱스로 설정된다.

# 슬라이싱도 가능하다.
ser3['Italy':]
>>>
Italy    39
Japan    81
China    86
Rusia     7
dtype: int64

Series의 Name

Series 객체와 Series 인덱스 모두 name 속성이 있다.

ser3.name = 'Country_PhoneNumber'
ser3.index.name = 'Country_Name'
ser3
>>>
Country_Name  #인덱스의 name 속성
Korea      82
America     1
Swiss      41
Italy      39
Japan      81
China      86
Rusia       7
Name: Country_PhoneNumber, dtype: int64 #시리즈의 name 속성

사실 pandas의 Dataframe은 Series의 연속이다.

DataFrame

표(table)과 같은 구조
여러개의 컬럼을 나타낼수 있다.

→csv, excel을 DataFrame으로 변환한다.

cf) Series : 인덱스 컬럼 ,값 컬럼 하나씩 / 딕셔너리 : 키 컬럼, 값 컬럼 하나씩

# Series로 변환
data = {'Region' : ['Korea', 'America', 'Chaina', 'Canada', 'Italy'],
        'Sales' : [300, 200, 500, 150, 50],
        'Amount' : [90, 80, 100, 30, 10],
        'Employee' : [20, 10, 30, 5, 3]
        }
s = pd.Series(data)
s
>>>
# 인덱스 외 한개의 컬럼만 갖는다. 
Region      [Korea, America, Chaina, Canada, Italy]
Sales                      [300, 200, 500, 150, 50]
Amount                        [90, 80, 100, 30, 10]
Employee                         [20, 10, 30, 5, 3]
dtype: object

# DataFrame으로 변환
d = pd.DataFrame(data)
d
>>>
# 여러 컬럼을 갖는다.

Untitled

# Series의 name 은 DataFrame의 컬럼명이다.
d.index=['one','two','three','four','five']
d.columns = ['a','b','c','d']
d
>>>

Untitled

구조화된 데이터의 표현법

Pandas와 함께 EDA 시작하기

EDA : Exploratory Data Analysis / 데이터 탐색

data
data.head()
data.tail()
data.columns # 컬럼명 확인
data.info() # 각 컬럼별 Null값과 자료형 확인

#개수(Count), 평균(mean), 표준편차(std), 최솟값(min), 4분위수(25%, 50%, 75%), 최댓값(max)
data.describe()

data.isnull().sum() # missing 데이터(결측치) 개수의 총합

EDA - 통계

data['RegionName'].value_counts() # 범주(Case 또는Category)별로 값이 몇 개 인지.

data['RegionName'].value_counts().sum() # 컬럼별 통계 수치의 합

print("총 감염자", data['TotalPositiveCases'].sum()) # 컬럼 값의 총합

data.sum() # DataFrame전체의 각 컬럼별로 합

# .corr() 상관관계 2개 컬럼필요

print(data['TestsPerformed'].corr(data['TotalPositiveCases']))
print(data['TestsPerformed'].corr(data['Deaths']))
print(data['TotalPositiveCases'].corr(data['Deaths']))

data.corr()

# 컬럼 삭제
data.drop(['Latitude','Longitude','Country','Date','HospitalizedPatients',  'IntensiveCarePatients', 'TotalHospitalizedPatients','HomeConfinement','RegionCode','SNo'], axis=1, inplace=True)

# 통계관련 메서드

count():  NA를 제외한 수를 반환합니다.
describe(): 요약 통계를 계산합니다.
min(), max(): 최소, 최댓값을 계산합니다.
sum(): 합을 계산합니다.
mean(): 평균을 계산합니다.
median(): 중앙값을 계산합니다.
var(): 분산을 계산합니다.
std(): 표준편차를 계산합니다.
argmin(), argmax(): 최소, 최댓값을 가지고 있는 값을 반환합니다.
idxmin(), idxmax(): 최소, 최댓값을 가지고 있는 인덱스를 반환합니다.
cumsum(): 누적 합을 계산합니다.
pct_change(): 퍼센트 변화율을 계산합니다.

10 minutes to pandas - pandas 1.3.5 documentation

허남철

AI꿈나무

이전 포스트

9. Data 어떻게 표현하면 좋을까?_LMS Node 브레이커

LMS Node 브레이커

파이썬의 리스트 : 가변적 연속열(Sequence)형, 동정 배열(Dynamic Array)

사실 pandas의 Dataframe은 Series의 연속이다.

8. 파이썬 잘하는 척 해보자__LMS Node 브레이커

0개의 댓글