pytesseract识别PDF文件中的文字（OCR）第三方库：pdf2image,pytesseract,numpy 需要将tesseract安装路径添加到环境变量

初试

import numpy as np
import pytesseract
from pdf2image import convert_from_path
def pdf_ocr(fname, **kwargs):
    images = convert_from_path(fname, **kwargs)
    text = ''
    for img in images:
        img = np.array(img)
        text += pytesseract.image_to_string(img)
    return text
fname = 'example.pdf'
# text = pdf_ocr(fname, first_page=7, last_page=8)
text = pdf_ocr(fname)
print(text)

可能出现的错误

错误：pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
解决：**conda install -c conda-forge poppler**

🐍 Python 入坑教程

【Python 文件】PDF转图片pdf2image使用教程

初试

可能出现的错误