Python Faker로 더미 데이터 생성 (1/2)

dong-il·2022년 8월 11일

python

더미 데이터 생성

목록 보기

1/2

📍 Intro

항해99 프로젝트에서 대규모 데이터를 가지고 성능이 얼마나 나오고 이것을 얼마나 개선했는지를 보여주기로 팀원들과 계획했다. 프로젝트 주제는 의류 데이터였는데 무신O를 크롤링할까 봤는데 robots.txt를 보니 막혀있다... 포트폴리오에 괜히 문제가 생길까봐 일단 크롤링은 패스! 주제를 바꾸는 것도 고려하면서 공공 데이터도 뒤져봤지만 우리가 원하는 정도의 데이터는 없었다... 그래서 결국 파이썬 Faker를 사용해 더미데이터를 만드는 것으로 결론이 났다.
그래서 이번 포스팅은 Faker 공식 문서를 보고 번역(?)도하고 정리해보고자 한다! 사실 번역은 파파고가 한다.

Faker

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.

번역
Faker는 당신을 위해 가짜 데이터를 생성하는 파이썬 패키지입니다. 데이터베이스를 부트스트랩하거나, 보기 좋은 XML 문서를 만들거나, 끈기 있게 테스트하거나, 프로덕션 서비스에서 가져온 데이터를 익명화하거나, Faker가 좋습니다.

version : v13.15.1
Python Tests : passing
coverage : 99%
license : MIT

Compatibility(호환성)

tarting from version 4.0.0, Faker dropped support for Python 2 and from version 5.0.0 only supports Python 3.6 and above. If you still need Python 2 compatibility, please install version 3.0.1 in the meantime, and please consider updating your codebase to support Python 3 so you can enjoy the latest features Faker has to offer. Please see the extended docs for more details, especially if you are upgrading from version 2.0.4 and below as there might be breaking changes.

This package was also previously called fake-factory which was already deprecated by the end of 2016, and much has changed since then, so please ensure that your project and its dependencies do not depend on the old package.

번역

버전 4.0.0부터 페이커는 파이썬 2에 대한 지원을 중단했으며 버전 5.0.0부터는 파이썬 3.6 이상만 지원한다. 아직 파이썬 2 호환성이 필요하다면 버전 3.0.1을 설치하고, 페이커가 제공하는 최신 기능을 즐길 수 있도록 코드베이스를 업데이트해 파이썬 3을 지원하는 방안을 고려해 주시기 바랍니다. 특히 버전 2.0.4 이하에서 업그레이드하는 경우 변경 사항이 발생할 수 있으므로 자세한 내용은 확장 문서를 참조하십시오.

이 패키지는 2016년 말에 이미 폐지된 fake-factory라고도 불렸고, 그 이후로 많은 것이 바뀌었으니, 당신의 프로젝트와 그 의존성이 이전 패키지에 의존하지 않도록 확실히 해주시기 바랍니다.

Basic Usage(기본 사용법)

pip install Faker : Faker 패키지 설치
기본적인 코드

from faker import Faker
fake = Faker()

fake.name()
# 'Lucy Cechtelar'

fake.address()
# '426 Jordy Lodge
#  Cartwrightshire, SC 88120-6700'

fake.text()
# 'Sint velit eveniet. Rerum atque repellat voluptatem quia rerum. Numquam excepturi
#  beatae sint laudantium consequatur. Magni occaecati itaque sint et sit tempore. Nesciunt
...

Each call to method fake.name() yields a different (random) result. This is because faker forwards faker.Generator.method_name() calls to faker.Generator.format(method_name).

번역
fake.name() 메서드에 호출할 때마다 다른(추적) 결과가 생성됩니다. faker가 faker를 포워드하기 때문이다.발전기.method_name은 페이커에 호출합니다.발전기.형식(https_name)을 지정합니다.

Provider(제공자)

Each of the generator properties (like name, address, and lorem) are called “fake”. A faker generator has many of them, packaged in “providers”.

번역
각 제너레이터 속성(예: 이름, 주소 및 lorem???)은 "fake"라고 합니다. faker 제너레이터에는 "제공자"로 패키지된 많은 것들이 있다.

from faker import Faker
from faker.providers import internet

fake = Faker()
fake.add_provider(internet)

print(fake.ipv4_private())

결과값

192.168.232.8

Localization

faker.Faker can take a locale as an argument, to return localized data. If no localized provider is found, the factory falls back to the default LCID string for US english, ie: en_US.

번역
faker.faker는 locale을 인수로 사용하여 지역화된 데이터를 반환할 수 있습니다. 현지화된 공급자가 없는 경우 공장에서는 미국 영어의 기본 LCID 문자열(즉, en_US)로 돌아갑니다.

from faker import Faker

fake = Faker('ko_KR')

for _ in range(10):
    print(fake.name())

결과값

이준서
이서영
배우진
김지우
박성호
강상호
윤영일
박건우
성영철
백경숙

faker.Faker also supports multiple locales. New in v3.0.0.

from faker import Faker
fake = Faker(['it_IT', 'en_US', 'ja_JP'])
for _ in range(10):
    print(fake.name())

결과값

鈴木 陽一
Leslie Moreno
Emma Williams
渡辺 裕美子
Marcantonio Galuppi
Martha Davis
Kristen Turner
中津川 春香
Ashley Castillo
山田 桃子

Optimizations(최적화)

The Faker constructor takes a performance-related argument called use_weighting. It specifies whether to attempt to have the frequency of values match real-world frequencies (e.g. the English name Gary would be much more frequent than the name Lorimer). If use_weighting is False, then all items have an equal chance of being selected, and the selection process is much faster. The default is True.

번역
Faker constructor는 use_weighting이라는 성능 관련 인수를 사용합니다. 값의 빈도가 실제 빈도와 일치하는지 여부를 지정합니다(예: 영어 이름 Gary라는 영어 이름이 Lorimer라는 이름보다 훨씬 더 자주 사용된다.) use_weighting이 False이면 모든 항목이 선택될 확률이 동일하고 선택 프로세스가 훨씬 빠릅니다. 기본값은 True입니다.

How to create a Provider(프로바이더 생성 방법)

from faker import Faker
from faker.providers import BaseProvider

fake = Faker()

class MyProvider(BaseProvider):
    def foo(self) -> str:
        return 'bar'

fake.add_provider(MyProvider)

print(fake.foo())

결과값

bar

How to create a Dynamic Provider(동적 프로바이더 생성 방법)

Dynamic providers can read elements from an external source.

번역
동적 제공자들은 외부 자원에서 요소를 읽을 수 있습니다.

from faker import Faker
from faker.providers import DynamicProvider

medical_professions_provider = DynamicProvider(
     provider_name="medical_profession",
     elements=["dr.", "doctor", "nurse", "surgeon", "clerk"],
)

fake = Faker()

fake.add_provider(medical_professions_provider)

for i in range(10):
    print(fake.medical_profession())

결과값

surgeon
dr.
doctor
clerk
clerk
surgeon
dr.
dr.
nurse
clerk

BaseProvider와의 차이점은 class를 생성하지 않고 바로 사용할 수 있다는 점인 것 같다.
사용법 정리
1. from faker.providers import DynamicProvider : DynamicProvider 임포트
1. provider_name으로 프로바이더이름 설정 후 elements에 리스트로 값 할당 => 변수에 저장
2. fake.add_provider(Dynamic 프로바이더를 저장한 변수)로 프로바이더 추가
3. fake.프로바이더명 실제로 사용할 때는 DynamicProvider 내부에 설정한 프로바이더 명으로 사용한다.

How to customize the Lorem Provider(로렘 제공자 커스텀 방법)

You can provide your own sets of words if you don’t want to use the default lorem ipsum one. The following example shows how to do it with a list of words picked from cakeipsum

번역
기본 로렘 입숨을 사용하지 않으려는 경우 사용자 고유의 단어 집합을 제공할 수 있습니다. 다음 예는 케이키썸에서 고른 단어 목록을 사용하여 그것을 하는 방법을 보여준다.

로렘 입숨(lorem ipsum; 줄여서 립숨, lipsum)은 출판이나 그래픽 디자인 분야에서 폰트, 타이포그래피, 레이아웃 같은 그래픽 요소나 시각적 연출을 보여줄 때 사용하는 표준 채우기 텍스트로, 최종 결과물에 들어가는 실제적인 문장 내용이 채워지기 전에 시각 디자인 프로젝트 모형의 채움 글로도 이용된다. 때로 로렘 입숨은 공간만 차지하는 무언가를 지칭하는 용어로도 사용된다.

from faker import Faker
fake = Faker()

my_word_list = [
'danish','cheesecake','sugar',
'Lollipop','wafer','Gummies',
'sesame','Jelly','beans',
'pie','bar','Ice','oat' ]

fake.sentence()

for i in range(10):
    print(fake.sentence(ext_word_list=my_word_list))

결과값

Oat beans Lollipop Jelly.
Cheesecake Jelly pie.
Lollipop oat Jelly sesame cheesecake sesame wafer.
Beans sesame Ice sugar Jelly Lollipop cheesecake.
Gummies pie wafer bar Gummies Ice sugar.
Sesame cheesecake cheesecake oat.
Jelly Ice bar sugar Jelly.
Oat cheesecake Lollipop Lollipop Ice.
Ice Lollipop Gummies Gummies.
Pie wafer oat Ice Lollipop Lollipop bar Jelly.

How to use with Factory Boy(Factory Boy와 함께 사용하는 방법)

Factory Boy already ships with integration with Faker. Simply use the factory.Faker method of factory_boy:

번역
Factory Boy는 이미 Faker와 통합된 상태로 출하됩니다. 그냥 factory_boy의 메서드인 factory.Faker를 이용하세요.

import factory
from myapp.models import Book

class BookFactory(factory.Factory):
    class Meta:
        model = Book

    title = factory.Faker('sentence', nb_words=4)
    author_name = factory.Faker('name')

- 이건 어떻게 쓰는지 잘 모르겠다..

Accessing the random instance(랜덤 인스턴스 접근중..?)

The .random property on the generator returns the instance of random.Random used to generate the values:

번역
제네레이터의 .random 프로퍼티는 값을 생성하는 데 사용되는 random.Random 인스턴스를 반환합니다. :

from faker import Faker
fake = Faker()
fake.random
fake.random.getstate()

결과값

By default all generators share the same instance of random.Random, which can be accessed with from faker.generator import random. Using this may be useful for plugins that want to affect all faker instances.

번역
기본적으로 모든 제네레이터들은 동일한 random.Random 인스턴스를 공유합니다. from faker.generator import random.로 액세스할 수 있습니다. 이 기능을 사용하면 모든 faker 인스턴스에 영향을 미치려는 플러그인에 유용할 수 있습니다.

- 이건 어떻게 쓰는지 잘 모르겠다..

Unique values(유니크한 값들)

Through use of the .unique property on the generator, you can guarantee that any generated values are unique for this specific instance.

번역
제네레이터에서 .unique 프로퍼티를 사용하면 생성된 값이 이 특정 인스턴스에 대해 고유함을 보장할 수 있습니다.

from faker import Faker
fake = Faker()
names = [fake.unique.first_name() for i in range(500)]
assert len(set(names)) == len(names)

unique가 중간에 들어가면 유니크한 값들이기 때문에 중복되는 값이 없다.

Calling fake.unique.clear() clears the already seen values. Note, to avoid infinite loops, after a number of attempts to find a unique value, Faker will throw a UniquenessException. Beware of the birthday paradox, collisions are more likely than you’d think.

번역
fake.unique.clear()를 호출하면 이미 표시된 값이 지워집니다. 참고: 무한 루프를 피하기 위해 고유 값을 찾으려고 여러 번 시도한 후 Faker는 UniquenessException을 던집니다. 생일 역설에 주의하세요, 충돌은 여러분이 생각하는 것보다 더 가능성이 높습니다.

from faker import Faker

fake = Faker()
for i in range(3):
     # Raises a UniquenessException
     fake.unique.boolean()

In addition, only hashable arguments and return values can be used with .unique.

번역
또한 hashable 인수와 반환 값만 .unique와 함께 사용할 수 있습니다.

boolean은 true, false 2개만 있으니 unique로 3개를 출력하면 UniquenessException이 발생한다.

Seeding the Generator

When using Faker for unit testing, you will often want to generate the same data set. For convenience, the generator also provide a seed() method, which seeds the shared random number generator. Calling the same methods with the same version of faker and seed produces the same results.

번역
단위 테스트에 Faker를 사용할 때 종종 동일한 데이터 셋을 생성하려고 할 수 있습니다. 또한 편의를 위해 제너레이터는 공유 난수 제너레이트를 심어주는(?) seed() 메서드를 제공합니다. 동일한 버전의 faker와 시드로 동일한 메서드를 호출하면 동일한 결과가 생성됩니다.

from faker import Faker
fake = Faker()
Faker.seed(4321)

print(fake.name())
# 'Margaret Boehm'

결과값

Jason Brown

seed는 공유 난수를 심어준다는데 계속 print해보면 같은 값이 나온다. 공식문서에는 'Margaret Boehm'라고 주석되어있는데 실제 결과값은 'Jason Brown'이다.

Each generator can also be switched to its own instance of random.Random, separate to the shared one, by using the seed_instance() method, which acts the same way. For example:

번역
각 제너레이터는 자체 random.Random 인스턴스로 전환할 수도 있습니다. 동일한 방식으로 작동하는 seed_instance() 메서드를 사용하여 공유된 것과 별개로 임의입니다. 예:

from faker import Faker
fake = Faker()
fake.seed_instance(4321)

print(fake.name())
# 'Margaret Boehm'

결과값

Jason Brown

Please note that as we keep updating datasets, results are not guaranteed to be consistent across patch versions. If you hardcode results in your test, make sure you pinned the version of Faker down to the patch number.

If you are using pytest, you can seed the faker fixture by defining a faker_seed fixture. Please check out the pytest fixture docs to learn more.

번역
데이터 셋을 계속 업데이트하므로 패치 버전 간에 결과가 일관되게 유지되지는 않습니다. 테스트 결과를 하드코드한 경우 Faker 버전을 패치 번호에 고정했는지 확인하십시오.

pytest를 사용하는 경우 faker_seed 고정장치를 정의하여 페이커 고정장치를 시드할 수 있습니다. 자세한 내용을 보려면 pytest 고정 장치 문서를 확인하십시오.

Standard Providers(표준 제공자들)

faker.providers
faker.providers.address
faker.providers.automotive
faker.providers.bank
faker.providers.barcode
faker.providers.color
faker.providers.company
faker.providers.credit_card
faker.providers.currency
faker.providers.date_time
faker.providers.file
faker.providers.geo
faker.providers.internet
faker.providers.isbn
faker.providers.job
faker.providers.lorem
faker.providers.misc
faker.providers.person
faker.providers.phone_number
faker.providers.profile
faker.providers.python
faker.providers.ssn
faker.providers.user_agent

dong-il

어떠한 가치를 창출할 수 있을까를 고민하는 개발자. 주로 Spring으로 개발해요.

다음 포스트

Python Faker로 더미 데이터 생성 (1/2)

더미 데이터 생성

📍 Intro

Faker

Compatibility(호환성)

Basic Usage(기본 사용법)

Provider(제공자)

Localization

Optimizations(최적화)

How to create a Provider(프로바이더 생성 방법)

How to create a Dynamic Provider(동적 프로바이더 생성 방법)

How to customize the Lorem Provider(로렘 제공자 커스텀 방법)

How to use with Factory Boy(Factory Boy와 함께 사용하는 방법)

Accessing the random instance(랜덤 인스턴스 접근중..?)

Unique values(유니크한 값들)

Seeding the Generator

Standard Providers(표준 제공자들)

Python Faker로 더미 데이터 생성 (2/2)

0개의 댓글