Neo4j - neo4j图形数据库入门 - 《自然语言处理》

neo4j图形数据库入门
自然语言处理模块结合
总结
- 报错

neo4j图形数据库入门

在对文本的数据进行充分的理解的基础上，需要对处理结果进行保存，由于主要工作是建立知识图谱，可以考虑使用图形数据库进行存储，使用neo数据库
基本的应用语法

from py2neo import Graph, Node, Relationship
##连接neo4j数据库，输入地址、用户名、密码
graph = Graph('http://localhost:7474/browser/', username='neo4j', password='neo4j')
tx = graph.begin()
a = Node("人物", name=x.word)
tx.create(a)
tx.commit()

上述的代码可以实现创立节点并提交到数据库

自然语言处理模块结合

随后开始考虑怎么和自然语言处理模块进行结合
最后实现了下述代码

import jieba
import jieba.analyse
import jieba.posseg
from py2neo import Graph, Node, Relationship
novel = open('D:\\New_desktop\\1.txt','r',encoding='UTF-8')
content=novel.read()
novel_segmented = open('D:\\New_desktop\\2.txt','w')
cutword = jieba.cut(content,cut_all=False)
seg = ' '.join(cutword).replace(',','').replace('。','').replace('“','').replace('”','').replace('：','').replace('…','')\
            .replace('！','').replace('？','').replace('~','').replace('（','').replace('）','').replace('、','').replace('；','')
try:
    with open('D:\\New_desktop\\2.txt', 'w+', encoding='utf-8') as f:
        f.write(seg)
except FileNotFoundError:
    print('无法打开指定的文件!')
except LookupError:
    print('指定了未知的编码!')
except UnicodeDecodeError:
    print('读取文件时解码错误!')
novel.close()
novel_segmented.close()
# 训练word2vec模型，生成词向量
from gensim.models import word2vec
# 训练word2vec模型，生成词向量
s = word2vec.LineSentence('D:\\New_desktop\\2.txt')
# 这里如果报编码错误，notepad++打开，然后转码为对应编码即可。
model = word2vec.Word2Vec(s,size=200,window=5,min_count=0,workers=4)
model.save('D:\\New_desktop\\3.txt')
print("model caculate over")
##连接neo4j数据库，输入地址、用户名、密码
graph = Graph('http://localhost:7474/browser/', username='neo4j', password='20010413ssjssj*')
tx = graph.begin()
print("graph establish over")
filename = "D:\\New_desktop\\1.txt"
fa = open(filename,"r",encoding='utf-8')
result = list()
for line in fa.readlines():
    line = line.strip()
    if not len(line):
        continue
    result.append(line)
fa.close
content=""
for sentence in result:
    sentence.encode('utf-8')
    data=sentence.strip()
    if len(data)!=0:
        content+=data
cutword = jieba.cut(content,cut_all=False)
seg = ''.join(cutword).replace(',','').replace('。','').replace('“','').replace('”','').replace('：','').replace('…','')\
            .replace('！','').replace('？','').replace('~','').replace('（','').replace('）','').replace('、','').replace('；','').replace('，','')
sentence_seged = jieba.posseg.cut(seg.strip())
print("divide word over")
outstr = ''
for x in sentence_seged:
    #outstr += "{}/{},".format(x.word, x.flag)
    # 上面的for循环可以用python递推式构造生成器完成
    # outstr = ",".join([("%s/%s" %(x.word,x.flag)) for x in sentence_seged])
    if x.flag == 'n':
        try:
            similar=model.wv.similarity("叶修",x.word)
            if similar>0.48:
                a = Node("人物", name=x.word)
                print(x.word)
                tx.create(a)
        except KeyError:
            print("not in vocabulary")
            print(x.word)
tx.commit()

这个代码是对文本进行分词解析，解析出词性，然后利用词向量进行分析。
作为小说的一名读者，我有“叶修是人”的先验知识，结合上一篇笔记对词向量的解析使用中，和叶修相似度高的都是人，所以考虑使用“叶修”作为分类数据，对所有的名词进行计算，对相似度高于一定阈值的认为是人，然后保存进数据库，进而实现出抽取出文本中人物实体的目的

显然在算法和代码上出了问题，只提取到了陈果和大家两个实体，而且没有加入查重的功能，还有待改进

总结

报错

报错有关于词向量的，词向量报错词不存在：解决，这个就是理解为jieba分词的局限性，在第一次分词的结果和词向量解析结果不同，如jieba分出了“时叶秋”这个词，并不在词向量的词库里，就出现了报错，最后使用代码里的try catch的方式来捕获异常。更好的一个解决方案就是改进分词算法，使得分词更加的精确
报错里有数据库提交不合法的，经过排查，初步认为应该是重复提交的问题，我似乎一下子提交了两个一样的变量就引发了异常（但是明明相同的节点是可以存在的）
在查重和分词计算上还可以大幅提高