https://www.jianshu.com/p/f7cb0b3f337a 下载安装教程

官方网站：https://github.com/tesseract-ocr/tesseract
官方文档：https://github.com/tesseract-ocr/tessdoc
语言包地址：https://github.com/tesseract-ocr/tessdata
下载地址：https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20211201.zip

安装：需额外下载中文包（如果要识别中文）
配置环境变量：安装目录 C:\Program Files\Tesseract-OCR C:\Program Files\Tesseract-OCR\tessdata
验证是否配置成功：cmd 输入 tesseract 回车，看信息即可

方式1：cmd 识别

进入cmd
输入：tesseract 图片路径识别结果文件名（.txt）

方式2：python识别

安装库：pytesseract pillow
脚本示例 ```python import pytesseract from PIL import Image

img_path = ‘test.png’ im = Image.open(img_path)

识别文字

string = pytesseract.image_to_string(im, lang=”eng”, config=”—psm 7”) # 识别数字和字母要加后面两个参数 print(string)

如识别报错，则如下操作：<br />![image.png](https://cdn.nlark.com/yuque/0/2021/png/2981571/1640793631459-df6f6186-37f8-4533-870f-edbdbdc920d3.png#clientId=uf50d2dc8-3c31-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=178&id=u34ea994d&margin=%5Bobject%20Object%5D&name=image.png&originHeight=355&originWidth=673&originalType=binary&ratio=1&rotation=0&showTitle=false&size=36728&status=done&style=none&taskId=uc626c0cb-69d5-4957-b13c-120ed2918ed&title=&width=336.5)<br />中文语言包训练集下载
<a name="KhtD2"></a>
# 应用一：识别PDF
**原理解释：**
1. 用 pdf2image 中的 convert_from_path 将 pdf 文件转化为 ppm 文件（图片）
1. 用 numpy.array 将 ppm 文件转化为三维矩阵
1. 用 pytesseract.image_to_string 识别图像矩阵中的文字
1. 输出文本信息，并进行校对，可以借助 word 等软件进行拼写检测
```python
import numpy as np
import pytesseract
from pdf2image import convert_from_path
def pdf_ocr(fname, **kwargs):
    images = convert_from_path(fname, **kwargs)
    text = ''
    for img in images:
        img = np.array(img)
        text += pytesseract.image_to_string(img)
    return text
fname = 'example.pdf'
# text = pdf_ocr(fname, first_page=7, last_page=8)
text = pdf_ocr(fname)
print(text)

🐍 Python 入坑教程

【Python爬虫】Tesseract 识别验证码、PDF、图片

方式1：cmd 识别

方式2：python识别

识别文字