PyconKorea | Database Read Replicas 100% 활용기

Jihun Kim·2022년 1월 27일

Aurora Database PyconKorea custom endpoint

파이콘

목록 보기

1/2

Pycon에서 한종원님이 Aurora read replica 활용 방안에 대해 발표하신 내용을 정리했다.

Why Aurora?

Cloud 인프라 Abstraction level

SaaS
- Software까지 통째로 대여
PaaS
- Platform까지 대여
- RDS가 여기에 해당함
- 클라우드 업체가 관리해 주는 부분이 IaaS에 비해 많아짐
IaaS
- Infrastructure만 대여
- EC2 내에 DBMS를 설치하는 것이 여기에 해당함
On-Premise
- Cloud를 쓰지 않음

Aurora의 특징

기존의 RDS보다 5배 더 나은 성능
무중단으로 auto scaling out 되는 storage
read replicas는 15개까지 지원되며
- replica를 붙였다 뗐다 하는 데 다운타임이 없음
RDS보다 빠른 read replicas 동기화 시간
- miliseconds 내에 동기화 됨
4가지 endpoint type
- cluster endpoint(read & write)
  - 읽고 쓰기가 전부 다 가능하며 writer instance에 붙는 주소
- reader endpoint(read only)
  - 리더 인스턴스들을 로드 밸런싱 해주는 주소
- custom endpoint
- instance endpoint
  - 클러스터 단위가 아닌 각 인스턴스에 직접 접근하기 위해 사용하는 주소
  - 보통은 클러스터 단위로 사용을 많이 함

Write and Read Replicas

Cluster(writer)/reader epdoint
- 1 writer instance, 0 reader instance
  - writer와 reader endpoint 각각이 동일하게 writer instance를 가리킴
- 1 writer instance, 1 reader instance
  - cluster endpoint는 writer를 가리킴
  - reader endpoint는 reader를 가리킴
- 1 writer instance, 2 reader instance
  - cluster endpoint는 writer를 가리킴
  - reader endpoint는 reader instance들을 가리킴(몇개든 간에)

(참고)

강의 시점(2019.9)으로 2주 전에 aws에서 한 cluster에 writer를 2개 이상 붙일 수 있도록 확장된 기능을 발표함(multi-writer)

Django DB Router

클러스터 구성에 따라 리더 endpoint와 라이터 endpoint의 ip 주소가 바뀌는데, 클러스터 구성을 바꾸고 ping을 하면 해당 ip 주소가 제대로 나오지 않는 경우가 있다.

이는 OS 레벨에서 DNS 캐시를 하기 때문이다.

아래의 방법은 강의에서 발표자님이 제시한 cluster/reader endpoint의 실제 IP 주소 변경 내역을 확인하는 방법이다. 맥 기준.
- 로컬이 잡고 있는 DNS 테이블을 전부 flush 하고 다시 DNS resolve를 하여 현재 상태에서 사용하는 ip 주소를 가져올 수 있게 된다.

Django는 2개 이상의 databases 대상으로 DB router를 기본 지원한다.

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql_psycopg2",
        "NAME": os.getenv("DB_NAME"),
        "USER": os.getenv("DB_USER"),
        "PASSWORD": os.getenv("DB_PASSWORD"),
        "HOST": os.getenv("DB_HOST"),
        "PORT": os.getenv("DB_PORT"),
    },
    "replica": {
        "ENGINE": "django.db.backends.postgresql_psycopg2",
        "NAME": os.getenv("DB_NAME"),
        "USER": os.getenv("DB_USER"),
        "PASSWORD": os.getenv("DB_PASSWORD"),
        "HOST": os.getenv("DB_HOST"),
        "PORT": os.getenv("DB_PORT"),
    },
}

DATABSE_ROUTERS = ['my_app.router.Router'

장고 설정에서 DATABASES에 복수개의 연결을 추가할 수 있다.
이 때, 주의할 점은 ‘replica’의 경우 다른 이름으로 변경해도 되지만 ‘default’는 변경하지 않는 것이 좋다.
- 왜냐하면, 서드파티 라이브러리의 경우 ‘default’라는 이름이 하드코딩 되어 들어있는 경우가 종종 있기 때문이다.
DATABSE_ROUTERS에는 리스트 형태로 내가 작성한 라우터들을 추가할 수 있다.

read 요청을 wirte와 reader 인스턴스 모두에 분산시키는 Router 설정

아래의 두 함수는 그대로 두고, db_for_read와 db_for_write만 수정하면 되는데 주로 db_for_read가 수정 대상이다.

장고 docs에서 예시로 제공하는 방법

import random

class Router:
    @staticmethod
    def db_for_read(model, **hints):
        databases = ['default', 'replica']
        return random.choice(databases)

    @staticmethod
    def db_for_write(model, **hints):
        return 'default'
    
    @staticmethod
    def allow_relation(obj1, obj2, **hints):
        return True
    
    @staticmethod
    def allow_migrate(db, app_label, model_name=None, **hints):
        return True

‘replica’에만 읽기 요청을 모두 보내면 과부하가 걸릴 수 있다. 그런데 default(writer instance)도 reader 역할을 하기 때문에 아래의 db_for_read 함수에서 ‘default’와 ‘replica’에 반반씩 보낼 수 있도록 설정 되어 있다.
하지만 이 경우, DB lock을 명시적으로 설정할 경우 문제가 생긴다.
- 만약 reader에 lock을 잡고 writer에 쓰기를 할 경우
  - writer는 reader로 변경사항을 동기화 하려 할 것이다.
  - 그런데 reader는 lock이 걸린 상태여서 사용이 불가능해 변경사항 동기화가 불가능하다.
- 반대로, writer에 lock을 잡고 writer에 쓰고 reader에서 변경사항을 확인하려 하는 경우
  - writer는 아직 커밋 되지 않은 상태여서 변경사항이 적용되지 않아 reader는 writer로부터 최신 정보를 가져올 수 없다.
  - 일관성이 깨지는 문제가 발생한다.
  - 가령, 돈을 송금했는데 잔고를 확인했는데 변화가 없는 상황이 생길 수 있다.

트랜잭션 문제를 수정한 방법

import random
from django.db import transaction

class Router:
    @staticmethod
    def db_for_read(model, **hints):
        conn = transaction.get_connection('default')
				if conn.is_atomic_block:
					return 'default'

				databases = ['default', 'replica']
        
				return random.choice(databases)

    @staticmethod
    def db_for_write(model, **hints):
        return 'default'
    
    @staticmethod
    def allow_relation(obj1, obj2, **hints):
        return True
    
    @staticmethod
    def allow_migrate(db, app_label, model_name=None, **hints):
        return True

현재 이 스레드에 들어왔을 때 이 스레드에 트랜잭션 넣은 것이 있는 지 없는 지 확인할 수 있다.
만약 트랜잭션 열린 것이 하나라도 존재한다면 변경사항이 있을 수 있으니 리더로 가면 안되기 때문에 이 때는 ‘default’(writer)로 가도록 라우터를 수정한 내용이다.
그러나 이 경우, replica를 하나 더 추가하면 문제가 발생한다.
- ‘default’, ‘replica’에 각각 50%씩 traffic이 가도록 설계되어 있는데, 그렇게 되면 두 개의 reader instance에 50%에 대해 또 다시 반으로 나뉘어 읽기 요청이 들어올 때 25%씩 트래픽이 가게 되는 것이다.
- 즉, reader를 추가하더라도 writer는 계속 50%만 처리하고 reader끼리 나머지 50%를 나누어 처리하게 되는 셈이다.

트래픽 분산 문제 해결 방법

import random
from django.db import transaction

class Router:
    @staticmethod
    def db_for_read(model, **hints):
        conn = transaction.get_connection('default')
				if conn.is_atomic_block:
					return 'default'

				# replica가 추가되었을 때 replica를 databases에 추가한다.
				databases = ['default', 'replica', 'replica']  
        
				return random.choice(databases)

    @staticmethod
    def db_for_write(model, **hints):
        return 'default'
    
    @staticmethod
    def allow_relation(obj1, obj2, **hints):
        return True
    
    @staticmethod
    def allow_migrate(db, app_label, model_name=None, **hints):
        return True

replica가 추가되었을 때 databases 리스트에 ‘replica’를 추가하면 writer와 reader 2개가 각각 1/3씩 트래픽을 가져가게 된다.
- 이 경우, replica가 추가 되면 리스트에 원소를 계속 추가해야 한다는 문제점이 있다.

Custom Endpoint를 설정하자

2018년 11월에 추가된 기능이다.

custom endpoint는 내가 추가하고 거기에 물려 있는 인스턴스들을 마음대로 조정할 수 있다.

다양한 구성으로 원하는 대로 endpoint를 조절할 수 있음

그러면 아래와 같이 Router 세팅이 간단해진다.

import random
from django.db import transaction

class Router:
    @staticmethod
    def db_for_read(model, **hints):
        conn = transaction.get_connection('default')
				if conn.is_atomic_block:  # 트랜잭션 내에서 발생하는 리드 쿼리는 writer가 처리하도록 한다.
					return 'default'
        
				return 'custom'  # 미리 DATABSES에 정의해 놓은 'custom'만 리턴하면 된다.

    @staticmethod
    def db_for_write(model, **hints):
        return 'default'
    
    @staticmethod
    def allow_relation(obj1, obj2, **hints):
        return True
    
    @staticmethod
    def allow_migrate(db, app_label, model_name=None, **hints):
        return True

이를 통해 리드 요청이 들어오면 리더 인스턴스로 분산 처리가 잘 된다.
라이터 성능을 올려야 하는 이슈가 생기더라도 리더를 추가하면 트래픽 분산이 잘 될 수 있다.

결론

단점

3rd party library 문제

oauth2_provider와 django_cache는 multi databases에 대비가 안되어 있음

따라서, 리드 요청이 어떤 provide로부터 온 것인지 확인한 다음 해당 요청은 항상 ‘default’로 가도록 설정해야 함(현재는 설정이 바뀌었을 수도 있다.)

import random
from django.db import transaction

class Router:
    @staticmethod
    def db_for_read(model, **hints):
				# 3rd party library에 따라 DB 분산
				if model.__meta.app_label == 'oauth2_provider':
					return 'default'
				if model.__meta.app_label == 'django_cache':
					return 'default'

        conn = transaction.get_connection('default')
				if conn.is_atomic_block:  
					return 'default'
        
				return 'custom'

lock 관련된 의외의 문제가 종종 발생함
- 이를 디버깅 하기가 쉽지 않다.
- 원래는 리드 쿼리를 reader 뿐만 아니라 writer쪽으로도 보내도록 되어 있었는데 이를 해결하기 위해 리드 쿼리가 들어오면 최대한 ‘replica’로 가도록 수정함
  - 즉, ‘custom’ 대신 ‘replica’를 쓰도록 변경함

장점

더 이상 scaling up할 필요가 없다.
- SELECT 쿼리에 한정해서 read replicas 추가(scale out)로 전체 DB throughput을 향상시킬 수 있다.
- sacle-out은 무중단이지만 scale-up은 다운타임이 발생한다.
남는 read replicas의 활용도를 높일 수 있다.

참고영상

Django DB Router로 Database Read Replicas 100% 활용기 및 Troubleshooting 경험 공유 - 한종원 - PyCon.KR 2019