tesseract - 이미지에서 text 추출

숲사람·2022년 3월 23일

유용한 도구

목록 보기

2/5

설치

python module과 컴퓨터에 실행파일 둘다 설치 필요

 $ pip3 install pytesseract
 $ sudo apt install tesseract-ocr

for mac

 $ brew install tesseract

코드

# This Python file uses the following encoding: utf-8
from PIL import Image
from pytesseract import image_to_string

img = Image.open('sample.jpg')
text = image_to_string(img)
print(text)

python 에서 사용예제

https://developer.ibm.com/tutorials/document-scanner/
https://m.blog.naver.com/samsjang/220694855018

  1 # This Python file uses the following encoding: utf-8
  2 from PIL import Image
  3 from pytesseract import image_to_string
  4 import sys
  5 
  6 filename = sys.argv[1]
  7 img = Image.open(filename)
  8 text = image_to_string(img, lang='kor')
  9 #print(text)
 10 print(text.encode('utf-8').decode('utf-8'))

한글 인식률이 좋기위한 조건

여러줄 인식률 떨어짐. 한줄씩 끊어서 사용
배경에 노이즈가 없어야함.

pytesseract 사용하기위한 설정

tesseract 4.0 다운로드
https://github.com/tesseract-ocr/tesseract/wiki

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install tesseract-ocr-kor

ubuntu16 기준임.
ubuntu18 도 동일한듯

kor 언어 데이터 다운로드
kor.traineddata 다운로드 (https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) <- 3.04 버전용.

터미널 명령어

tesseract -c preserve_interword_spaces=1 ../../../eng_kor_test_img.PNG stdout -l kor+eng --psm 4

kor+eng 두언어 동시가능 -> 인식률 떨어짐
--psm 인식률 조정 옵션인듯

https://niceman.tistory.com/155
t(text.encode('utf-8'))

숲사람

기록 & 정리 아카이브용

이전 포스트

Gitbook - 블로그 문서화 플랫폼

다음 포스트

tesseract - 이미지에서 text 추출

유용한 도구

설치

코드

python 에서 사용예제

한글 인식률이 좋기위한 조건

pytesseract 사용하기위한 설정

터미널 명령어

Gitbook - 블로그 문서화 플랫폼

[python] 파일에서 중복된 라인 제거하기

0개의 댓글