4-4. Importing JSON Data and Working with APIs

Hey:D·2022년 3월 5일

가짜연구소 dsf-2기로 활동중이며, 데이터 엔지니어 4코스: Importing JSON Data and Working with APIs를 수강하고 그에 관련 내용을 적어보았다.

🌱 출처 : Datacamp data enigneering track
모르는 부분이나 헷갈렸던 부분은 계속 추가될 수 있다!

Importing Data from JSON Data and Working with APIs
1. Introduction to JSON
2. Introduction to APIs
3. Working with nested JSONs
4. Combining multiple datasets
5. Wrap-up

1. Introduction to JSON

JSON 파일에 대한 경로 또는 URL이 주어지면 JSON 데이터를 pandas에 로드하는 방법을 알아봅시다.

Javacscript Obeject Notation / JSON 데이터에 중점을 두어 살펴봅시다.

JSON이 웹을 통해 데이터를 전송하는 일반적인 형태
tabular 형태가 아님.
데이터가 개체 컬렉션으로 구성됨(모두 동일한 속성 집합을 가질 필요가 없음)
객체는 python 사전과 유사함
JSON은 중첩될 수 있음 : 값 자체는 개체 또는 개체 목록이 될 수 있음.

JSON 데이터는 표 형식이 아니므로 pandas에서 데이터 프레임에 로드하기 위해 작업합니다. pandas는 가장 자주 접하게 되는 레코드 및 열 방향을 자동으로 감지합니다.

read_josn()
dtype 인수 사용하여 데이터 타입 지정가능
orient 키워드 인수 : 흔하지 않는 JSON data 레이아웃에 flag를 지정 --

파일 크기를 줄이기 위해 JSON은 열 지향일 수 있습니다.

import pandas as pd
death_causes =pd.read_json("nyc_daath_causes.json",
							orient="split")

2. Introduction to APIs

JSON 파일 및 API에서 데이터를 가져오는 방법을 알아봅시다.

API(Application Programming Interfaces)
API는 공유 리소스이며 지정된 기간에 얻을 수 있는 데이터의 양을 제한하는 경우가 많음.
Requests 라이브러리 사용
requesets.get(url_string) 사용하여 URL에서 데이터 받아옴
params : 매개변수 이름 및 값 사전을 전달
headers : 사용 중인 API에 사용자 인증 키가 필요한 경우 헤더에 전달
response.json() : 데이터만 가져오려면 이 메서드 사용

response.json() 메서드는 사전을 반환함.

import requests
import pandas as pd

api_url = "https://api.yelp.com/v3/businesses/search"

# Set up parameter dictionart according to docummentation
params = {"term": "bookstore", "
			location": "San Francisco"}
            
# Set up header dictionary w/ API key according to documentation
headers = {"Authorization": "Bearer {}".format(api_key)}

# Call the API
response = requests.get(api_url, 
						params=params, 
                        headers=headers)

# Isolate the JSON data from the response object
data = response.json()


# Load businesses data to dataframe
bookstores = pd.DataFrame(data["businesses"])

3. Working with nested JSONs

중첩된 JSON 데이터를 재구성하는 방법에 대해 알아봅시다.

JSON은 속성-값 쌍이 있는 개체가 포함되어 있습니다.
값 자체가 객체인 경우에는 JSON이 중첩됩니다.

아래 파란 박스가 businesse 아래에 중첩된 부분들입니다.

이처러 중첩된 부분을 확인할 수 있습니다.

pandasd에서 중첩된 JSON을 병합하는 기능이 있습니다.

pandas.io.json
json_normalize() 함수를 사용하여 중첩 데이터를 평면화.

pd.DataFrame으로 로드되는 JSON은 JSON 정규화와 함께 로드됨.

import pandas as pd
import requests
from pandas.io.json import json_normalize

# Set up headers, parameters, and API endpoint
api_url = "https://api.yelp.com/v3/businesses/search"
headers = {"Authorization": "Bearer {}".format(api_key)}
params = {"term": "bookstore", "
			location": "San Francisco"}

# Make the API call and extract the JSON data
response = requests.get(api_url, 
						headers=headers, 
                        params=params)
data = response.json()

# Flatten data and load to dataframe, with _separators
bookstores = json_normalize(data["businesses"], sep="_")
print(list(bookstores))

그러나 범주는 여전히 중첩되어 있습니다.

값이 중첩된 데이터에 대한 몇 가지 접근 방식이 있습니다. 사용자 정의 병합 함수를 작성하거나 분석과 관련이 없다고 결정하고 그대로 둘 수 있습니다.

Json_normalize 인수 사용,
record_path : 중첩 데이터에 대한 속성 문자열 또는 문자열 목록을 사용함.
meta
meta_prefix : 출처를 명확히 하고 열 이름 중복되지 않도록 meta컬럼에 작업해줌

df = json_normalize(data["businesses"]
					, sep="_", 
                    record_path="categories",
					meta=["name", 
                    		"alias", 
                            "rating", 
                            ["coordinates", "latitude"], 
                            ["coordinates","longitude"]],
              		  meta_prefix="biz_")

데이터는 평평하지만 여러범주가 있는 "businesses"는 반복되는 것을 볼 수 있습니다. 사용 사례에 따라 문제가 없을 수도 있고 처리해야할 수도 잇습니다. 이처럼 JSON 작업을 꽤 길어질 수도 있습니다.

4. Combining multiple datasets

여러 위치에서 데이터를 가져오기 위한 pandas 메소드를 알아봅시다.

데이터 추가하는 방법.

append() df1.append(df2)
ignore_indes = True 설정 pandas의 행 번호 기본 인덱스 사용할시 (행 레이블 다시 지정)

# Get first 20 bookstore results
params = {"term": "bookstore", 
			"location": "San Francisco"}
first_results = requests.get(api_url, 
							headers=headers, 
                            params=params).json()

first_20_bookstores = json_normalize(first_results["businesses"], sep="_")

print(first_20_bookstores.shape) # (20,24)

# Get the next 20 bookstores
params["offset"] = 20 
next_results = requests.get(api_url, 
							headers=headers, 
                            params=params).json()
next_20_bookstores = json_normalize(next_results["businesses"], spe="_")


# Put bookstore datasets together, renumver rows
bookstores = first_20_bookstores.append(next_20_bookstores, ignore_index=True)

데이터를 결합하는 또 다른 방법은 Merging하는 방법입니다.

merge() : pandas에서 SQL join

merge는 데이터프레임 메서드이다. df.merge()로 사용하고 , on으로 키워드 인수로 join 할 열을 지정할 수 있습니다.

merged = call_counts.merge(weather,
							left_on="created_date", 
                            right_on="date")

5. Wrap-up

pandas의 데이터 랭글링, DataCamp에서 Data Manipulation with Python Skill Track 수업을 찾아보는 것을 추천한다.

Hey:D

걸음마 분석가

이전 포스트

Tableau 스터디 2주차 -(1)

다음 포스트