pytesseract识别PDF文件中的文字(OCR) 第三方库:pdf2image,pytesseract,numpy 需要将tesseract安装路径添加到环境变量
初试
import numpy as np
import pytesseract
from pdf2image import convert_from_path
def pdf_ocr(fname, **kwargs):
images = convert_from_path(fname, **kwargs)
text = ''
for img in images:
img = np.array(img)
text += pytesseract.image_to_string(img)
return text
fname = 'example.pdf'
# text = pdf_ocr(fname, first_page=7, last_page=8)
text = pdf_ocr(fname)
print(text)
可能出现的错误
错误:pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
解决:**conda install -c conda-forge poppler**