pytesseract识别PDF文件中的文字(OCR) 第三方库:pdf2image,pytesseract,numpy 需要将tesseract安装路径添加到环境变量

初试

  1. import numpy as np
  2. import pytesseract
  3. from pdf2image import convert_from_path
  4. def pdf_ocr(fname, **kwargs):
  5. images = convert_from_path(fname, **kwargs)
  6. text = ''
  7. for img in images:
  8. img = np.array(img)
  9. text += pytesseract.image_to_string(img)
  10. return text
  11. fname = 'example.pdf'
  12. # text = pdf_ocr(fname, first_page=7, last_page=8)
  13. text = pdf_ocr(fname)
  14. print(text)

可能出现的错误

错误:pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
解决:**conda install -c conda-forge poppler**