背景
产品新功能需要用到OCR技术,调研选用了Tess4j,受限于服务器版本,在网络上并没有找到太多部署指南,遂记录一番,愿带给你一片清明。
CentOS部署
CentOS:6.10
步骤(主要参考安装步骤):
# 安装开发工具及Tesseract先决条件
yum -y groupinstall "development tools"
yum -y install libpng-devel libtiff-devel libjpeg-devel
# 安装CentOS Software Collections yum repository 及较新版本的GCC
yum -y install centos-release-scl
yum -y install devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-binutils
# scl仅临时启用新的gcc,退出shell或重启将恢复原系统版本
scl enable devtoolset-7 bash
# 使用source长期启用
source /opt/rh/devtoolset-7/enable
# 安装autoconf请安装尽可能新版本,提示autoconf v2.64或更高
cd /usr/src/
wget ftp://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz
tar xvvfz autoconf-2.69.tar.gz
cd autoconf-2.69/
./configure --prefix=/usr
make
make install
安装autoconf-archive
cd /usr/src/
wget http://ftpmirror.gnu.org/autoconf-archive/autoconf-archive-2019.01.06.tar.xz
tar xvvfJ autoconf-archive-2019.01.06.tar.xz
cd autoconf-archive-2019.01.06/
./configure --prefix=/usr
make
make install
# 安装Leptonica,tesseract v4.0.0要求Leptonica v1.77及以上
cd /usr/src/
wget http://leptonica.org/source/leptonica-1.77.0.tar.gz
tar xvvfz leptonica-1.77.0.tar.gz
cd leptonica-1.77.0/
./configure --prefix=/usr/local/
make
make install
# 依赖安装完毕,安装最新版tesseract
cd /usr/src/
wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.tar.gz -O tesseract-4.1.1.tar.gz
tar xvvfz tesseract-4.1.1.tar.gz
cd tesseract-4.1.1
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure --prefix=/usr/local/ --with-extra-libraries=/usr/local/lib/ --disable-openmp
make install
# 检查是否安装成功
tesseract --version
# tesseract 4.1.1
# leptonica-1.77.0
# libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3
常见问题:
- 找不到命令
tesseract: command not found
解决:将/usr/local/bin添加到$PATH中,修改~/.bash_profile,添加 export PATH="$PATH:/usr/local/bin"
- 无法加载库资源文件
Unable to load library ‘tesseract’: Native library (linux-x86-64/libtesseract)
解决:将/usr/local/lib下相关的tesseract和leptonica的library(.so)的文件复制到 /usr/lib
- 缺少环境变量
!strcmp(locale, “C”):Error:Assert failed:in file ../../../src/api/baseapi.cpp
解决:export LC_ALL=C
- 写jpg格式问题,OpenJDK does not have a native JPEG encoder
javax.imageio.IIOException: Invalid argument to native writeImage
解决: new BufferedImage(width, height, BufferedImage.TYPE_3BYTE_BGR);
CentOS:7
步骤:
-
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/ sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key yum update yum install tesseract yum install tesseract-langpack-deu
参考:
- Installing Tesseract OCR 4.0 on CentOS 6
- linux tesseract 安装及部署tess4j项目的常见问题
- !strcmp(locale, “C”):Error:Assert failed:in file ../../../src/api/baseapi.cpp, line 191 #105