PyOCR を Windows で動かしたい

この記事を作った動機

　pyocrを使うおうとしたらライブラリだけじゃなくて、OCRを実際にする部分である tesseract をインストールしないといけないことが分かったので記録をとってみる。前提として Windows 環境上で作業していて、Python環境はすでに構築されていて機能するものとする。

　ちなみに、躓いた点としては、tesseract をただインストールする必要があるだけでなく、環境変数Pathを通して、tesseractコマンドが使える状態しないと、pyocr が tesseract を見つけられないという点があった。

　Windows と違って、Linux や Mac 系の環境では、tesseract をインストールした時点ですでにパスが通っているところにバイナリが配置され、追加の設定が不要ということがあるかもしれないが、現時点ではまだ確かめていない。

`pyocr` を使えるようにする手順

`pyocr` ライブラリの導入

pip install pyocr

`tesseract` をインストール

インストーラの取得

Home · UB-Mannheim/tesseract Wiki から、tesseract-ocr-w64-setup-XXXXXXXXXXXXXX.exeをダウンロードする。

日本語のOCRデータが含まれるように指定する

　インストーラを実行し進める。ただ単に進めるだけでなく、途中で日本語向けのOCRデータを含むように明示する必要がある。

　インストーラを進めている途中で出てくる以下のパスは、のちにPathという環境変数を設定するために使うため、どこかにコピーしておく。私の環境では以下のようになった。

環境変数を通す

　インストーラで見えていた場所に対して、環境変数のPathを設定する。なお変更の反映にはターミナルなどの再起動が必要である。また設定したのにうまく反映されない場合は PC ごと再起動した方がいいかもしれない。

　環境変数が通ると、コマンドプロンプトなどで以下のように表示されるはずである。

tesseract
# Usage:
#   C:\Program Files\Tesseract-OCR\tesseract.exe --help | --help-extra | --version
#   C:\Program Files\Tesseract-OCR\tesseract.exe --list-langs
#   C:\Program Files\Tesseract-OCR\tesseract.exe imagename outputbase [options...] [configfile...]
# 
# OCR options:
#   -l LANG[+LANG]        Specify language(s) used for OCR.
# NOTE: These options must occur before any configfile.
# 
# Single options:
#   --help                Show this help message.
#   --help-extra          Show extra help for advanced users.
#   --version             Show version information.
#   --list-langs          List available languages for tesseract engine.

　環境変数がうまく通ってない場合の例(PowerShell)

tesseract
# tesseract : The term 'tesseract' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of 
# the name, or if a path was included, verify that the path is correct and try again.
# At line:1 char:1
# + tesseract
# + ~~~~~~~~~
#     + CategoryInfo          : ObjectNotFound: (tesseract:String) [], CommandNotFoundException
#     + FullyQualifiedErrorId : CommandNotFoundException

`pyocr` を動かしてみる

　以下のコードを動かして、No OCR tool foundと出なかったら成功だと思われる。以下のコードは、pyocr · PyPIに書いてある初期化コードの引用である。

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

　私の環境では以下のような出力になった。

python main1.py
# Will use tool 'Tesseract (sh)'
# Available languages: eng, jpn, jpn_vert, osd, script\Japanese, script\Japanese_vert
# Will use lang 'eng'

参考にしたサイトとか

Release 5.5.1 · tesseract-ocr/tesseract
https://github.com/tesseract-ocr/tesseract/releases/tag/5.5.1 (2025年12月2日)
tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
https://github.com/tesseract-ocr/tesseract (2025年12月2日)
Introduction | tessdoc
https://tesseract-ocr.github.io/tessdoc/Installation.html (2025年12月2日)
Home · UB-Mannheim/tesseract Wiki
https://github.com/UB-Mannheim/tesseract/wiki (2025年12月2日)
python - No tools available from pyOCR - Stack Overflow
https://stackoverflow.com/questions/31892413/no-tools-available-from-pyocr (2025年12月2日)
pyocr · PyPI
https://pypi.org/project/pyocr/ (2025年12月2日)

この記事を作った動機#

pyocr を使えるようにする手順#

pyocr ライブラリの導入#

tesseract をインストール#

インストーラの取得#

日本語のOCRデータが含まれるように指定する#

環境変数を通す#

pyocr を動かしてみる#

参考にしたサイトとか#