pytesseract识别PDF文件中的文字(OCR) 第三方库:pdf2image,pytesseract,numpy 需要将tesseract安装路径添加到环境变量
初试
import numpy as npimport pytesseractfrom pdf2image import convert_from_pathdef pdf_ocr(fname, **kwargs):images = convert_from_path(fname, **kwargs)text = ''for img in images:img = np.array(img)text += pytesseract.image_to_string(img)return textfname = 'example.pdf'# text = pdf_ocr(fname, first_page=7, last_page=8)text = pdf_ocr(fname)print(text)
可能出现的错误
错误:pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
解决:**conda install -c conda-forge poppler**
