PPOCRLabel을 이용한 OCR 데이터 Annotation (for Windows)

최가윤·2023년 7월 19일

PaddlePaddle에서 제공하는 Annotation 프로그램을 이용해 데이터에 직접 BBox를 만들어보자!

이 글에서 활용하는 PPOCRLabel은 PaddlePaddle에서 제공하는 Open-Source Annotation Tool로, 한국어를 포함한 다양한 언어를 지원하고 있다. PaddlePaddle의 완성도가 높고 PPOCRLabel의 사용이 간단해 많이 추천되는 프로그램이다.

이 글에서는 Windows 기반 로컬에서 PPOCRLabel을 설치하고 실행하는 방법을 다룰 예정이다.

1. Virtual Environments with Conda

PPOCRLabel은 설치 시에 오류가 잘 발생하는 편이라 가능하면 가상환경 이용을 추천한다. 현재 python=3.10 버전까지 지원하고 있지만 혹시 몰라 낮은 버전으로 설치했다.

# Create New Folder
mkdir Paddle
cd Paddle

# Create New Conda Environment
conda create -n paddle python=3.8
conda activate paddle

# Update pip
pip3 install --upgrade pip

2. Install Paddle

PPOCRLabel 설치에 앞서 PaddlePaddle을 다운로드 받자. PaddleOCR Github README.md에 있는 코드를 사용해도 되고, [Installation Document]에서 자신의 환경에 맞는 버전을 찾아도 된다.

Install

문제는 Paddle 설치는 되지만 PPOCRLabel 실행이 안 되는 경우가 많다!! Paddle과 PPOCR 간의 priority로 인해 발생하는 에러인데 에러의 종류도 다양하고 Paddle에서 알려준 에러 해결 방법을 써봐도 해결이 어려웠다. 그래서 만약 에러가 발생한다면 버전 수정 등으로 해결하기보다는 설치 방법을 바꿔서 새로 설치하는걸 추천한다.

🥲 내가 경험한 에러들

1. opencv 버전 에러
설치 환경
-Linux, CUDA11.7
-Github README.md와 Installation Document의 코드 모두에서 발생

## Installation Code
# from Github
  python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple

# from Installation Document
conda install paddlepaddle-gpu==2.5.0 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge

예전에 경험한 에러라 에러메시지가 없다.
처음에는 GPU를 사용하기 위해 Linux 서버에 설치하려고 했지만 opencv 버전과 관련한 여러 오류가 발생했다. opencv 버전을 낮추거나 path를 바꿔주는 등 여러 방법을 시도해보았지만 해결이 되지 않았다. 아무래도 코드를 실행하면 프로그램이 열리는 방식이라 SSH 서버에서는 오류가 가중되어서 해결이 어려운 것으로 추측하고 있다.

Paddle에서 알려주는 해결방법은 다음과 같다. 즉 opencv의 버전을 바꿔보라는 것.
If paddleocr is installed with whl, it has a higher priority than calling PaddleOCR class with paddleocr.py, which may cause an exception if whl package is not updated.

For Linux users, if you get an error starting with objc[XXXXX] when opening the software, it proves that your opencv version is too high. It is recommended to install version 4.2:

pip install opencv-python==4.2.0.32

If you get an error starting with Missing string id,you need to recompile resources:

pyrcc5 -o libs/resources.py resources.qrc

If you get an error module 'cv2' has no attribute 'INTER_NEAREST', you need to delete all opencv related packages first, and then reinstall the 4.2.0.32 version of headless opencv

pip install opencv-contrib-python-headless==4.2.0.32

2. proto 버전 에러
설치 환경
-Windows11
-Installation Document의 코드에서 발생

# Installation Code
conda install paddlepaddle==2.5.0 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
...
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

1번과 유사하게 패키지 버전 에러라 굳이 해결하려고 힘빼지 않았다ㅎ

😀 최종 설치 방법

설치 환경
-Windows11
-Github README.md의 코드
-python3로 설치 시 오류가 발생해서 python으로 설치

## From Github
# If you only have cpu on your machine, please run the following command to install
python -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple

3. Install PPOCRLabel

PaddlePaddle 설치가 끝나면 드디어 PPOCRLabel을 설치할 수 있다.

# Install 
pip install PPOCRLabel

# Select label mode and run 
PPOCRLabel  # [Normal mode] for [detection + recognition] labeling

처음 PPOCRLabel을 실행할 때에는 각종 언어 파일을 다운로드해서 시간이 조금 소요된 뒤에 아래와 같은 프로그램이 실행된다.

PPOCRLabel Program Page

4. Run PPOCRLabel

Korean Support Error

PPOCRLabel은 Auto Recognition 기능을 통해 Annotation 초벌을 진행해줘서 밑바닥부터 진행할 필요가 없다. 한국어 데이터셋의 Annotation을 진행하려면 먼저 OCR model 설정을 바꿔야 한다.

Choose OCR model

Korean

기본 모델은 Chinese & English이다. 모델 선택 창에는 Korean 선택지가 있는데 누르면 놀랍게도 프로그램이 강제종료된다^^

ppocr ERROR: lang korean is not support, we only support dict_keys(['en', 'ch']) for layout models

에러가 발생한 이유는 Korean은 detection과 recognition만 지원하고 layout은 지원하지 않기 때문이다. 해결하려면 det, rec 모드로 진입을 하면 되나?싶었는데 그런 기능은 없고... 소스 코드의 layout 부분에 fake model을 추가하면 된다.
according to this error, that's because the Structure model only support ['ch', 'en'], so if you only use the OCR model, you can add a fake model to table and layout params in here and then reinstall paddleocr by this way - Github ISSUES 10008

fake model 추가는 C:\Users\{윈도우 계정 이름}\miniconda3\envs\paddle\Lib\site-packages\paddleocr_\paddleocr.py 파일의 코드를 다음과 같이 바꿔주자.

# Original
'PP-StructureV2': {
            ...
            'layout': {
                'en': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt'
                },
                'ch': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_cdla_dict.txt'
                },
            }
        }

# Changed with Fake Model 
'PP-StructureV2': {
			...
            'layout': {
                'en': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt'
                },
                'ch': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_cdla_dict.txt'
                },
                'korean': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_cdla_dict.txt'
                }
            }
        }

같은 아시아 언어인 중국어가 그나마 잘 되지 않을까해서 ch을 그대로 붙여넣었는데 en으로도 가능하다. 수정 후 다시 PaddleOCR을 실행하면 Korean Model이 정상적으로 작동하는걸 볼 수 있다!

Recognition result를 보면 오타가 있는 편이지만, 그럼에도 불구하고 밑바닥부터 BBox를 지정하고 타자를 치는 것보다 훨씬 편하기에 PPOCRLabel 사용 메리트가 있다.

Run Recognition

PPOCRLabel의 AutoRecognition 사용 방법은 다음과 같다.

우상단 File > Open Dir > 데이터셋이 있는 폴더 선택: 해당 폴더에 있는 모든 이미지 파일을 File List에 불러온다.
우하단 Auto Recongtion 클릭: File List에 있는 모든 이미지에 Annotation을 진행한다. 데이터가 많을수록 시간이 오래 걸리니 기다리자. CPU로도 생각보다 괜찮은 속도가 나온다.
cntrl+s 로 결과 저장: 1번에서 선택한 폴더에 Label.txt 파일로 저장된다.

# Label.txt
{"transcription": "GPT-3는", 
"points": [[301, 13], [487, 13], [487, 58], [301, 58]], "difficult": false}, 
{"transcription": "6o논?", 
"points": [[503, 13], [670, 10], [671, 57], [504, 60]], 
"difficult": false}, 
{"transcription": "미래를", 
"points": [[688, 9], [820, 9], [820, 61], [688, 61]], "difficult": false}, 
{"transcription": "어떻게", 
"points": [[829, 5], [959, 9], [958, 62], [828, 58]], "difficult": false}, 
{"transcription": "바꿀까", "points": [[976, 12], [1102, 12], [1102, 58], [976, 58]], "difficult": false}
...

BBox 수정은 다음과 같다. 단축키를 쓰면 훨씬 편하다!

Recongnition Result 변경: 내용을 바꾸려는 박스를 누른 뒤 ctrl+e
BBox 삭제: 지우려는 박스를 누른 뒤 Backspace
BBox 추가
- Create RectBox (W): 직사각형 생성. 원하는 위치에 한 점을 찍은 뒤 드래그
- Create PolygonBox (Q): 폴리곤 생성. 원하는 위치에 네 점을 차례로 찍는다

이 외에도 다양한 기능을 제공하고 있으니 시간이 나면 살펴보자!

최가윤

Study AI (2022 - )

다음 포스트

VSCODE SSH서버 무한 로그인 해결

2개의 댓글

happy

2023년 7월 19일

너무 좋은 글이네요. 공유해주셔서 감사합니다.

답글 달기

jeong jihoon

2023년 7월 27일

ubuntu에서 위에 처럼 레이아웃에 korean추가해도 안되는데 혹시 아실까요?

답글 달기