1. 常用热词词库的配置方式
1.采用IK 内置词库
优点:部署方便,不用额外指定其他词库位置
缺点:分词单一化,不能指定想分词的词条
2.IK 外置静态词库
优点:部署相对方便,可以通过编辑指定文件分词文件得到想要的词条
缺点:需要指定外部静态文件,每次需要手动编辑整个分词文件,然后放到指定的文件目录下,重启ES后才能生效
3.IK 远程词库
优点:通过指定一个静态文件代理服务器来设置IK分词的词库信息
缺点:需要手动编辑整个分词文件来进行词条的添加, IK源码中判断头信息Last-Modified ETag 标识来判断是否更新,有时不生效
结合上面的优缺点,决定采用Mysql作为外置热词词库,定时更新热词 和 停用词。
2. 准备工作
1.下载合适的ElasticSearch对应版本的IK分词器
https://github.com/medcl/elasticsearch-analysis-ik
2.我们来查看它config文件夹下的文件:

因为我本地安装的是ES是6.6.2版本,所以下载的IK为6.6.2的适配版
3.分析IKAnalyzer.cfg.xml 配置文件:

ext_dict:对应的扩展热词词典的位置,多个热词文件之间使用分号来进行间隔
ext_stopwords:对应扩展停用词词典位置,多个之间用分号进行间隔
remote_ext_dict:远程扩展热词位置 如:https://xxx.xxx.xxx.xxx/ext_hot.txt
remote_ext_stopwords:远程扩展停用词位置 如:https://xxx.xxx.xxx.xxx/ext_stop.txt
4.Dictionary类
Dictionary中单例方法public static synchronized Dictionary initial(Configuration cfg)
public static synchronized Dictionary initial(Configuration cfg) {if (singleton == null) {synchronized (Dictionary.class) {if (singleton == null) {singleton = new Dictionary(cfg);singleton.loadMainDict();singleton.loadSurnameDict();singleton.loadQuantifierDict();singleton.loadSuffixDict();singleton.loadPrepDict();singleton.loadStopWordDict();if(cfg.isEnableRemoteDict()){// 建立监控线程for (String location : singleton.getRemoteExtDictionarys()) {// 10 秒是初始延迟可以修改的 60是间隔时间 单位秒pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);}for (String location : singleton.getRemoteExtStopWordDictionarys()) {pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);}}return singleton;}}}return singleton;}
initial中 load*中方法是利用config中其他文本文件来初始化Dictionary中的上面声明的成员变量:
_MainDict : 主词典对象,也是用来存储热词的对象
_SurnameDict : 姓氏词典
_QuantifierDict : 量词词典,例如1个中的 个 2两种的两
_SuffixDict : 后缀词典
_PrepDict : 副词/介词词典
_StopWords : 停用词词典
3. 修改Dictionary源码
1. Dictionary类:更新词典 this.loadMySQLExtDict()
/*** 加载主词典及扩展词典*/private void loadMainDict() {// 建立一个主词典实例_MainDict = new DictSegment((char) 0);// 读取主词典文件Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);loadDictFile(_MainDict, file, false, "Main Dict");// 加载扩展词典this.loadExtDict();// 加载远程自定义词库this.loadRemoteExtDict();// 2. 从mysql加载词典this.loadMySQLExtDict();}private static Properties prop = new Properties();static {try {Class.forName("com.mysql.jdbc.Driver");} catch (ClassNotFoundException e) {logger.error("error", e);}}/*** @Title: loadMySQLExtDict* @Description: 从mysql加载热更新词典* @author 石鹏程* @created 2019年3月3日* @param:* @return: void* @throws*/private void loadMySQLExtDict() {Connection conn = null;Statement stmt = null;ResultSet rs = null;try {Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");prop.load(new FileInputStream(file.toFile()));/*logger.info("[==========]jdbc-reload.properties");for(Object key : prop.keySet()) {logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key)));}logger.info("[==========]query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......");*/conn = DriverManager.getConnection(prop.getProperty("jdbc.url"),prop.getProperty("jdbc.user"),prop.getProperty("jdbc.password"));stmt = conn.createStatement();rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));int i=0;while(rs.next()) {String theWord = rs.getString("word");//logger.info("[==========]hot word from mysql: " + theWord);_MainDict.fillSegment(theWord.trim().toCharArray());i++;}logger.info("[==========] 加载分词数量: " + i+"个");} catch (Exception e) {logger.error("erorr", e);} finally {if(rs != null) {try {rs.close();} catch (SQLException e) {logger.error("error", e);}}if(stmt != null) {try {stmt.close();} catch (SQLException e) {logger.error("error", e);}}if(conn != null) {try {conn.close();} catch (SQLException e) {logger.error("error", e);}}}}
2. Dictionary类:更新停用词 this.loadMySQLStopwordDict()
/*** 加载用户扩展的停止词词典*/private void loadStopWordDict() {// 建立主词典实例_StopWords = new DictSegment((char) 0);// 读取主词典文件Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP);loadDictFile(_StopWords, file, false, "Main Stopwords");// 加载扩展停止词典List<String> extStopWordDictFiles = getExtStopWordDictionarys();if (extStopWordDictFiles != null) {for (String extStopWordDictName : extStopWordDictFiles) {logger.info("[Dict Loading] " + extStopWordDictName);// 读取扩展词典文件file = PathUtils.get(extStopWordDictName);loadDictFile(_StopWords, file, false, "Extra Stopwords");}}// 加载远程停用词典List<String> remoteExtStopWordDictFiles = getRemoteExtStopWordDictionarys();for (String location : remoteExtStopWordDictFiles) {logger.info("[Dict Loading] " + location);List<String> lists = getRemoteWords(location);// 如果找不到扩展的字典,则忽略if (lists == null) {logger.error("[Dict Loading] " + location + "加载失败");continue;}for (String theWord : lists) {if (theWord != null && !"".equals(theWord.trim())) {// 加载远程词典数据到主内存中logger.info(theWord);_StopWords.fillSegment(theWord.trim().toLowerCase().toCharArray());}}}//3.加载自定义停用词this.loadMySQLStopwordDict();}/*** 从mysql加载停用词*/private void loadMySQLStopwordDict() {Connection conn = null;Statement stmt = null;ResultSet rs = null;try {Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");prop.load(new FileInputStream(file.toFile()));conn = DriverManager.getConnection(prop.getProperty("jdbc.url"),prop.getProperty("jdbc.user"),prop.getProperty("jdbc.password"));stmt = conn.createStatement();rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));int i=0;while(rs.next()) {String theWord = rs.getString("word");//logger.info("[==========]hot stopword from mysql: " + theWord);_StopWords.fillSegment(theWord.trim().toCharArray());i++;}logger.info("[==========] 加载停用词数量: " + i+"个");} catch (Exception e) {logger.error("erorr", e);} finally {if(rs != null) {try {rs.close();} catch (SQLException e) {logger.error("error", e);}}if(stmt != null) {try {stmt.close();} catch (SQLException e) {logger.error("error", e);}}if(conn != null) {try {conn.close();} catch (SQLException e) {logger.error("error", e);}}}}
3. 对外暴露方法:
public void reLoadMainDict() {try {Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");prop.load(new FileInputStream(file.toFile()));String enble = prop.getProperty("ik.mysql.enable");logger.info("ik.mysql.enable:"+enble);if(enble.equals("yes")){//logger.info("重新加载词典...");// 新开一个实例加载词典,减少加载过程对当前词典使用的影响Dictionary tmpDict = new Dictionary(configuration);tmpDict.configuration = getSingleton().configuration;tmpDict.loadMainDict();tmpDict.loadStopWordDict();_MainDict = tmpDict._MainDict;_StopWords = tmpDict._StopWords;//logger.info("重新加载词典完毕...");//logger.info("当前排队线程数:"+((ThreadPoolExecutor)pool).getQueue().size());//logger.info("当前活动线程数:"+((ThreadPoolExecutor)pool).getActiveCount());//logger.info("执行完成线程数:"+((ThreadPoolExecutor)pool).getCompletedTaskCount());//logger.info("总线程数(排队线程数+活动线程数+执行完成线程数):"+ ((ThreadPoolExecutor)pool).getTaskCount());}else{logger.info("ik分词mysql热加载关闭......");}} catch (IOException e) {logger.info("ik分词器 - jdbc-reload.properties文件解析出错....");e.printStackTrace();}}
4. HotDictReloadThread Runnable实现类,去执行 reLoadMainDict 加载热词

最后代码为定时调用:
其中一些细节就不讲述了。
4. 打包
因为我们链接的是mysql数据库,所以maven项目要引入mysql驱动:
<dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>6.0.6</version></dependency>
准备完毕:执行打包。 mvn clean package
打包完毕。 上传,重启进行实验啦
5.实验结果
数据库插入记录

GET http://127.0.0.1:9200/g_index/_analyze?text=王者荣耀&analyzer=ik_max_word{"tokens": [{"token": "王者荣耀","start_offset": 0,"end_offset": 5,"type": "CN_WORD","position": 0},{"token": "王者","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 1},{"token": "王","start_offset": 1,"end_offset": 2,"type": "CN_WORD","position": 2},{"token": "者","start_offset": 2,"end_offset": 3,"type": "CN_WORD","position": 3},{"token": "荣耀","start_offset": 3,"end_offset": 5,"type": "CN_WORD","position": 4},{"token": "荣","start_offset": 3,"end_offset": 4,"type": "CN_WORD","position": 5},{"token": "耀","start_offset": 4,"end_offset": 5,"type": "CN_CHAR","position": 6}]}
