견고한 Python - 3일차 Collection Types

놀고 싶은데, 왜 다들 공부하는거야·2024년 6월 5일

robust_python

목록 보기

3/5

Collection Types

collection type은 single value 데이터를 저장하는 collection으로 python의 list, dict, set이 있다. string 역시도 character의 collection이다. 이러한 collection들은 제각기 다른 동작과 원리로 이루어져 있어, 다르게 사용해야하는데 문제는 사용자 입장에서는 이를 하나하나 디버깅하고 어떤 데이터가 내부에 있는 지 알기 쉽지 않다는 것이다. 따라서 type annotation을 사용하여 어떤 collection type인지 알기 쉽도록 하자.

1. Annotating Collections

collection에 대한 type annotating을 사용하면 어떤 collection type이고 그 안의 element는 어떤 type인지 쉽게 알 수 있다.

AuthorToCountMapping = dict[str, int]
def create_author_count_mapping(
				cookbooks: list[Cookbook]
                               ) -> AuthorToCountMapping:
    counter = defaultdict(lambda: 0)
    for book in cookbooks:
        counter[book.author] += 1
    return counter

create_author_count_mapping는 cookbooks라는 list[Cookbook]을 매개변수로 받는다. 이를 통해 알 수 있는 것은 cookbooks는 list이고 내부 element로 Cookbook이 들어간다는 것이다. 또한, 반환값의 타입으로 AuthorToCountMapping을 반환하는데, AuthorToCountMapping는 dict[str, int]의 alias이다. 이를 통해서 반환값이 dict이고 key는 str이며, value는 int라는 사실을 알 수 있다.

AuthorToCountMapping alias를 사용한 이유는 dict[str, int] 자체 타입으로는 이 collection이 어떤 맥락인지 알 수 없기 때문이다. AuthorToCountMapping alias이름으로 유추할 수 있듯이 해당 collection이 author가 만든 책 수를 제공한다는 것을 알 수 있다.

2. Homogeneous(동종의) vs Heterogeneous(이종의) collections

python의 collection이 가진 특징 중 하나는 다른 언어들과는 달리 다른 타입의 element를 같은 collection에서 관리할 수 있다는 것이다. 이렇게 서로 다른 타입들이 같은 collection안에 있는 것을 Heterogeneous collection이라고 하며, 같은 타입들로만 이루어진 일반적인 collection을 Homogeneous collection이라고 한다.

문제는 사용자가 해당 collection이 Heterogeneous collection인지 Homogeneous collection인지 확신할 수가 없다는 것이다. 이를 알기위해서는 코드를 하나하나 따라가봐야하며, 특별한 경우에 Heterogeneous collection이면 각 타입마다의 특별한 처리를 해주어야 한다.

def adjust_recipe(recipe, servings):
    """
    Take a meal recipe and change the number of servings
    :param recipe: A list, where the first element is the number of servings,
                   and the remainder of elements follow the (name, amount, unit)
                   format, such as ("flour", 1.5, "cup")
    :param servings: the number of servings
    :return list: a new list of ingredients, where the first element is the
                  number of servings
    """
    new_recipe = [servings]
    old_servings = recipe[0]
    factor = servings / old_servings
    recipe.pop(0)
    while recipe:
            ingredient, amount, unit = recipe.pop(0)
            # please only use numbers that will be easily measurable
            new_recipe.append((ingredient, amount * factor, unit))
    return new_recipe

다음의 코드는 최악이다. 입력값으로 받는 recipe나 반환값으로 내보내는 new_recipe나 같은 형태를 띄는데, 둘 다 list이고 첫번째 값은 int타입의 servings 수를 받으며 두번째 값으로 tuple인 (str, int, str)을 받는다. 가령 [2, ("flour", 1.5, "cup")] 이런 모습을 띄는 것이다.

이를 알기위해서는 client가 함수를 모두 확인해보거나, doc string을 확인해야하는데 안타깝게도 doc string은 어떠한 check도 해주지 않는다. 또한 doc string은 전혀 신뢰할 수가 없다. 즉, 코드와 완전히 동일하지 않을 수 있다는 것이다.

다음의 Heterogeneous collection문제는 list안에 두 가지 타입이 존재한다는 것이다. 하나는 int하나는 tuple이다. 따라서, list안의 element에 두가지 타입이 공존한다는 것을 알려주기 위해서 Union을 사용할 수 있다.

Ingredient = tuple[str, int, str] # (name, quantity, units)
Recipe = list[Union[int, Ingredient]] # the list can be servings or ingredients
def adjust_recipe(recipe: Recipe, servings) -> Recipe:
    # ...

다음의 코드는 Recipe가 list이며 element로 int혹은 Ingredient로 이루어진 것을 볼 수 있다. Ingredient는 tuple로 tuple[str, int, str]` 형식을 갖는다.

이렇게 타입을 지정하고나면 client입장에서 doc string을 읽는 것보다 코드를 더 쉽게 파악할 수 있고, typechecker를 통해서 Heterogeneous collection가진 여러 type에 대한 special case를 처리할 수 있다. 즉, 각 type들이 가진 서로 다른 behavior에 대처할 수 있다는 것이며, 이를 IDE와 typechecker가 도와줄 수 있다는 것이다.

물론, 이러한 Heterogeneous collection은 굉장히 최악이다. 다만, tuple의 경우는 예외이다. list의 경우는 어떤 타입이 나올 지 모른다는 단점이 있지만 tuple은 명확하게 해당 부분에는 해당 값이 나온다는 특징이 있어 Heterogeneous collection을 사용하기 좋다.

가령, 하나의 데이터가 name과 page의 갯수를 포함하고 싶다면 다음과 같이 할 수 있다.

Cookbook = tuple[str, int] # name page count

Heterogeneous collection을 사용하기 좋은 유일한 경우로 tuple은 각 위치의 type이 고정되며, for문을 순회하여 서로 다른 타입에 같은 동작을 할 필요가 없기 때문에 좋다.

해당 tuple을 접근하는 방법은 다음과 같다.

food_lab: Cookbook = ("The Food Lab", 958)
odd_bits: Cookbook = ("Odd Bits", 248)

print(food_lab[0])
>>> "The Food Lab"

print(odd_bits[1])
>>> 248

dict역시도 tuple과 같이 많이 Heterogeneous collection으로 많이 사용된다.

food_lab = {
    "name": "The Food Lab",
    "page_count": 958
}

다음의 dict는 name은 str이고 page_count는 int인 경우이다. 둘 다 key는 str이지만 value타입이 str과 int인 경우이다. 이 또한 다음과 같이 Union으로 표현할 수 있다.

def print_cookbook(cookbook: dict[str, Union[str,int]])
    # ...

문제는 dict의 경우 다양한 타입을 받아야 해서 Union에 들어가는 type들이 너무 많아지며, 이러한 일은 매우 지루하다는 것이다.

때문에 dict에 대해서는 TypedDict라는 것이 사용된다.

3. TypedDict

TypedDict는 python 3.8부터 도입되었다. TypedDict는 두 가지 조건이 만족해야 사용할 이유가 있다.
1. Heterogeneous collection으로 dict를 사용해야할 때
2. 개발자가 통제할 수 없는 타입이 올 때

1번은 이해가 되었겠지만 2번은 애매모호하다. 이는 JSON이나 YAML과 같은 형식으로 온 데이터를 파싱한다고 생각해보자. 이를 받아내는 dict가 특정 key에 상응하는 value가 어떤 타입인지 알 수가 없다. 즉 통제할 수가 없는 것이다. 따라서, 이러한 경우는 TypedDict로 받아내야 한다.

반면에, 개발자가 통제할 수 있는 경우는 dataclass를 사용하면 된다. 즉, 어떤 key에 어떤 값의 타입이 오는 지 안다면 class의 dataclass를 사용하는 것이 더 직관적이라는 것이다.

다른 회사의 API를 사용한다고 해보자. 가령 get_nutrition_from_spoonacular라는 API를 사용해서 영양정보를 얻고싶다고 한다면 다음과 같이 할 수 있을 것이다.

nutrition_information = get_nutrition_from_spoonacular(recipe_name)
# print grams of fat in recipe
print(nutrition_information["fat"]["value"])

문제는 get_nutrition_from_spoonacular에서 반환하는 타입이 무엇인지 모른다는 것이다. 어떤 dict의 형태가 올지 모르니 docs를 뒤져봐야한다. 그런데 docs가 과연 실제 코드와 일치할 지 장담할 수가 없다. 때문에 실제 코드를 돌려보고 어떻게 나오는 지도 확인해야한다.

이는 매우 번거로운 일이며, 다음의 code를 review하는 사람들에게는 청천벽력같은 소리이다. 직접 코드를 살펴보고 어떤 dict가 올지 알아서 하라는 것이기 때문이다.

이런 경우에 TypedDict를 사용해서 문제를 해결할 수 있다. 위의 get_nutrition_from_spoonacular에서 다음과 같은 json형식의 데이터가 반환된다고 하자.

{
    "recipes_used": 1,
    "calories": {
        "value": 1,
        "uint": 2,
        "confidenceRange95Percent": {
            "min": 1.2,
            "max": 1.5,
        },
        "standardDeviation": 0.5
    },
    "fat": {
        "value": 1,
        "uint": "1",
        "confidenceRange95Percent": {
            "min": 1.2,
            "max": 1.5,
        },
        "standardDeviation": 0.5
    },
    "protein": {
        "value": 1,
        "uint": "1",
        "confidenceRange95Percent": {
            "min": 1.2,
            "max": 1.5,
        },
        "standardDeviation": 0.5
    },
    "carbs": {
        "value": 1,
        "uint": "1",
        "confidenceRange95Percent": {
            "min": 1.2,
            "max": 1.5,
        },
        "standardDeviation": 0.5
    },
}

다음의 json 형식을 TypedDict로 받는다면 다음과 하면된다.

from typing import TypedDict
class Range(TypedDict):
    min: float
    max: float

class NutritionInformation(TypedDict):
    value: int
    unit: str
    confidenceRange95Percent: Range
    standardDeviation: float

class RecipeNutritionInformation(TypedDict):
    recipes_used: int
    calories: NutritionInformation
    fat: NutritionInformation
    protein: NutritionInformation
    carbs: NutritionInformation

nutrition_information:RecipeNutritionInformation = \
	get_nutrition_from_spoonacular(recipe_name)

이렇게 만든다면 TypedDict가 있기 때문에 typechecker가 해당 API로 받는 json의 결과와 무엇이 다른지 확인해줄 수 있고, 잘못된 타입의 연산을 막을 수 있다.

단, 이러한 TypedDict는 언제나 typechecker를 통해서만 가능한 것이지 runtime에서는 불가능하다.

4. Generics

generic도 사용할 수 있는데, 이는 어떤 타입이 들어오던 상관이 없는 새로운 collection을 만들 때 사용한다. 가령 다음과 같이 입력으로 받은 list를 reverse하게 만들어주는 함수를 보도록 하자.

def reverse(coll: list) -> list:
    return coll[::-1]

list의 element type이 비어있기 때문에 어떤 element가 들어가야하는 지 고민할 수 있다. 어떤 타입이든 상관없다는 것을 알려주기 위해서 Generic을 사용하도록 하자. Generic을 사용하기 위해서는 typing의 TypeVar을 사용해야 한다.

from typing import TypeVar
T = TypeVar('T')
def reverse(coll: list[T]) -> list[T]:
    return coll[::-1]

T type을 가진 value가 list의 element로 들어갈 수 있다는 것이며, T는 어떠한 type도 된다. 다만, T가 어떤 타입인지 정해지면 그 타입만 list의 element로 들어갈 수 있다. 가령 T가 int라면 str는 list에 못들어가고 오직 int만이 list만에 들어갈 수 있다.

다음 예제로, Graph class를 만든다고 하자. Graph에는 Node와 Edge 두 타입이 있고, 하나의 Node에는 여러 개의 Edge가 연결되어 있을 수 있다고 하자. 다음과 같이 만들 수 있다.

from collections import defaultdict
from typing import Generic, TypeVar

Node = TypeVar("Node")
Edge = TypeVar("Edge")

# directed graph
class Graph(Generic[Node, Edge]):
    def __init__(self):
        self.edges: dict[Node, list[Edge]] = defaultdict(list)

    def add_relation(self, node: Node, to: Edge):
        self.edges[node].append(to)

    def get_relations(self, node: Node) -> list[Edge]:
        return self.edges[node]

class에 TypeVar로 만든 타입을 Generic으로 넣어주면 Optional이나 Union처럼 class[T, Y]이렇게 타입으로 쓸 수 있다. Node와 Edge를 Generic으로 덮어주고 만들어주면 다음과 같이 사용할 수 있다.

cookbooks: Graph[Cookbook, Cookbook] = Graph()
recipes: Graph[Recipe, Recipe] = Graph()

cookbook_recipes: Graph[Cookbook, Recipe] = Graph()

recipes.add_relation(Recipe("pasta1"), Recipe("pasta2"))

cookbook_recipes.add_relation(Cookbook("The Food Lab"), Recipe("pasta1"))

Graph를 하나의 타입처럼 쓰고 있는데, 내부에 사용되는 타입을 [T,Y]로 지정하고 있는 것을 볼 수 있다. 여기에서는 Graph[Cookbook, Cookbook], Graph[Recipe, Recipe], Graph[Cookbook, Recipe]을 타입으로 사용하고 있다.

이렇게 Generic을 사용하면 좋은 점이 typechecker를 통해서 확인이 가능하다는 것이다. 가령 cookbooks의 타입이 Graph[Cookbook, Cookbook]인데 다른 타입이 들어가면 typechecker를 통해 에러가 발생한다.

cookbooks.add_relation(Recipe('Cheeseburger'), Recipe('Hamburger'))

typechecker를 사용하면 다음의 결과가 나온다.

code_examples/chapter5/invalid/graph.py:25:
    error: Argument 1 to "add_relation" of "Graph" has
           incompatible type "Recipe"; expected "Cookbook"

Generic은 위와 같이 collection의 일반적인 type을 만들 때 자주 사용되는데, 또 다른 사용처로는, 반복되는 코드를 감소시킬 때 좋다. 다음의 예제를 보도록 하자.

def get_nutrition_info(recipe: str) -> Union[NutritionInfo, APIError]:
    # ...

def get_ingredients(recipe: str) -> Union[list[Ingredient], APIError]:
    #...

def get_restaurants_serving(recipe: str) -> Union[list[Restaurant], APIError]:
    # ...

위의 예제에서 3개의 API는 모두 실패시에는 APIError 성공시에는 각자의 반환값을 반환한다. 앞으로 이와 관련된 새로운 API를 만들다보면 Union이 반복되고 Union안에 Union이 있는 복잡한 구조들이 반복된다. 이러한 반복을 줄이기 위한 좋은 방법 중 하나가 바로 Generic이다.

T = TypeVar("T")
APIResponse = Union[T, APIError]

def get_nutrition_info(recipe: str) -> APIResponse[NutritionInfo]:
    # ...

def get_ingredients(recipe: str) -> APIResponse[list[Ingredient]]:
    #...

def get_restaurants_serving(recipe: str) -> APIResponse[list[Restaurant]]:
    # ...

APIResponse라는 Union[T, APIError]의 alias를 만들고 T는 TypeVar로 만든 template이다. Union[T, APIError]의 alias는 T가 template이기 때문에 APIResponse는 하나의 type을 받을 수 있다. 따라서 APIResponse[NutritionInfo] 이런식으로 사용할 수 있는 것이다.

이렇게 APIResonse를 만들어 사요하면 Union을 계속해서 반복할 필요가 없어 코드가 단순화되는 장점이 있다.

5. Modifying Existing types

만약 기존의 dict에서 일부 기능만 바꾼 새로운 dict를 만들고 싶다고 하자. 가령, arugula와 rocket은 문맥상 같은 의미이므로 dict에서 arugula의 값을 저장하고, rocket을 호출하면 arugula의 값을 반환하도록 하고 싶다.

>>> nutrition = NutritionalInformation()
>>> nutrition["arugula"] = get_nutrition_information("arugula")
>>> print(nutrition["rocket"]) # arugula is the same as rocket
{
    "name": "arugula",
    "calories_per_serving": 5,
    # ... snip ...
}

이렇게 하기위해서는 dict를 custom하게 만드는 방법이 있는데, dict를 하나하나 처음부터 구현하는 일은 쉬운 일이 아니다. 때문에 기존의 dict를 받아서 subclass를 만드는 방법이 좋다. 즉, dict의 기능을 모두 상속받는데, 일부는 override하는 방법이다.

class NutritionalInformation(dict):
    def __getitem__(self, key):
        try:
            return super().__getitem__(key)
        except KeyError:
            pass
        for alias in get_aliases(key):
            try:
                return super().__getitem__(alias)
            except KeyError:
                pass
        raise KeyError(f"Could not find {key} or any of its aliases")

다음과 같이 만들 수 있다. NutritionalInformation는 사용자가 만든 custome dict로 dict의 subclass이다. 때문에 dict의 모든 기능을 상속받는다. 단, 우리가 원하는 것은 문맥상 동일한 key값이 오면 해당 key의 value를 반환하는 것이므로, 값을 반환하는 기능을 하는 __getitem__(self, key)를 오버라이드하도록 한 것이다.

get_aliases(key)를 통해서 key의 이름에 해당하는 alias가 있다면 해당 alias의 값을 가져와서 반환하는 것이 전부이다.

잘 동작하는 것처럼 보이겠지만, 한 가지 문제가 있다. 이는 다음과 같다.

# arugula is the same as rocket
>>> nutrition = NutritionalInformation()
>>> nutrition["arugula"] = get_nutrition_information("arugula")
>>> print(nutrition.get("rocket", "No Ingredient Found"))
"No Ingredient Found"

우리가 만든 custom dict인 NutritionalInformation는 __getitem__만을 오버라이드했지, get을 오버라이드하진 않았다. 때문에 get을 호출 시에는 이전의 동작대로 실행된다.

이를 해결하기위해서 collections 모듈의 UserDict 타입을 사용하면 된다. UserDict를 사용하여 subclass를 만들면 사용자가 원하는 dict를 만들 수 있으며, 위와 같이 __getitem__을 오버라이드했지만 get은 반영이 안되는 사항이 사라진다. 즉, UserDict는 정말 사용자가 dict를 만들기 편하게 하기위해 여러가지를 설정해준 type인 것이다.

from collections import UserDict
class NutritionalInformation(UserDict):
    def __getitem__(self, key):
        try:
            return self.data[key]
        except KeyError:
            pass
        for alias in get_aliases(key):
            try:
                return self.data[alias]
            except KeyError:
                pass
        raise KeyError(f"Could not find {key} or any of its aliases")

collections의 UserDict를 우리의 dict의 부모 클래스로 상속받고 __getitem__을 오버라이드하면 get 메서드도 이와 같이 적용된다. 재밌는 것은 super를 쓸 필요없이 self.data로 데이터를 저장할 수 있고 접근할 수 있다는 것이다.

# arugula is the same as rocket
>>> print(nutrition.get("rocket", "No Ingredient Found"))
{
    "name": "arugula",
    "calories_per_serving": 5,
    # ... snip ...
}

이전에서 발생한 문제가 해결된다.

UserDict뿐만 아니라 UserString, UserList collections model도 있다. 만약, collection의 구동 방식을 custom하고 싶다면 UserString, UserList, UserDict를 사용하도록 하자. 단, 성능에 대한 어느정도의 compensation은 필요하다.

6. ABC

ABC는 Abstract base classes로 collections.abc 모듈에 있으며, 새로운 collection을 만들기 위해서 반드시 지켜야할 method들을 정리한 것으로 생각하면 된다. 재밌는 것은, ABC에서 요구하는 일부 method들을 구현하면 나머지는 자동으로 구현해준다는 것이다.

UserSet은 따로 없기 때문에 collections.abc.Set를 사용하여 만들어야 한다. collections.abc.Set은 다음의 3가지 메서드를 구현해야 하는데, 이 3개만 구현하면 집합 연산(교집합, 합집합 등)과 equal연산 등이 자동으로 구현된다.

__contains__: value가 set안에 있는 지, 없는 지 확인한다. 이 메서드를 구현할 때 alias에 있는 value인지도 확인한다.
__iter__: iterating을 위해 사용한다.
__len__: length를 확인한다.

collections.abc.Set을 상속받고, 이 3가지 메서드를 구현하여 user custom set collection을 만들어보도록 하자.

import collections
class AliasedIngredients(collections.abc.Set):
    def __init__(self, ingredients: set[str]):
        self.ingredients = ingredients

    def __contains__(self, value: str):
        return value in self.ingredients or any(alias in self.ingredients
                                                for alias in get_aliases(value))

    def __iter__(self):
        return iter(self.ingredients)

    def __len__(self):
        return len(self.ingredients)

다음과 같이 만들 수 있다. AliasdIngredients 클래스는 collections.abc.Set을 입력으로 받고, 내부 데이터로 self.ingredients set을 운용한다. __contains__ 메서드를 통해서 해당 value가 self.ingredients에 있는 지 확인하고 없으면 alias로 동일한게 있는 지 확인하도록 한다.

잘 동작하는 지 확인해보도록 하자.

def get_aliases(value: str) -> dict[str,str]:
    aliases = {"rocket": "arugula"}
    return aliases

ingredients = AliasedIngredients({'arugula', 'eggplant', 'pepper'})
for ingredient in ingredients:
    print(ingredient) # 'arugula' 'eggplant' 'pepper'

print(len(ingredients)) #3

print('arugula' in ingredients) #True

print('rocket' in ingredients) #True

list(ingredients | AliasedIngredients({'garlic'})) # ['pepper', 'arugula', 'eggplant', 'garlic']

잘 동작하는 것을 볼 수 있다. 이 밖에도 collections.abc는 다양한 것들이 있다. 굳이 collections.abc를 통해 새로운 collection을 만들 필요는 없이 generic한 type check로도 사용할 수 있다. 가령, __iter__을 구현한 타입만 매개변수로 들어갈 수 있다면 다음과 같이 할 수 있다.

def print_items(items: collections.abc.Iterable):
    for item in items:
        print(item)

collections.abc.Iterable는 Iterable ABC로 __iter__메서드를 구현한 타입만 받을 가능하다.

참고로 python3.9부터 25개의 서로 다른 ABC가 제공되므로 한 번 확인해보도록 하자. https://docs.python.org/3/library/collections.abc.html#module-collections.abc

놀고 싶은데, 왜 다들 공부하는거야

R3의 망령

이전 포스트

견고한 Python - 2일차 Constaining types

다음 포스트