如何用neo4j对50万掘金用户进行分析
数据获取
利用 python的requests 写一个简单的爬虫,以站长 阴明 为起点,利用对掘金用户的follower和 followee 关系进行深度和广度遍历抓取,共获取到50万用户(目前掘友总数据估计在200万以上),711万的follow关系。对于没有follow关系的用户,暂不做分析。
follow数据样例:
{
"objectId": "56887f661482e8a3f10d5061",
"follower": {
"objectId": "56887e3e60b2a099cdd3d8df",
"username": "逗论",
...
},
"followeeId": "557f9654e4b0d02dc2d7125d",
"createdAtString": "2016-01-03T01:54:46.919Z",
"updatedAtString": "2016-01-03T01:54:46.947Z"
}
user数据样例:
{
"objectId": "5946111561ff4b006ced8427",
"followersCount": 0,
"followeesCount": 1,
"username": "abu1497764117405",
...
}
关于 neo4j
neo4j是当前知识图谱领域,一款非常优秀的图数据存储和分析挖掘的主流(数据库)工具,当前已经更新至4.0版本。通过neo4j能够实现上亿级别的实体和关系挖掘。本文以掘友用户User和关注follow关系网分析为主,构建的(实体A,关系,实体B)三元关系组相对简单,主要为(User,follow,User)。如下图
数据导入
neo4j 导入数据方法有很多,如LOAD CSV能够读取CSV文件、neo4j-import能够对上亿级别数据快速初始化导入、apoc-jdbc能够直连常见关系型数据库mysql\oarcle等获取数据。
为简便使用,本次采集的数据以json格式存放在文件中。导入方式选择读取文件,加载json string并解析。
启动neo4j shell:
$ ./bin/cypher-shell
给User添加索引,提高后续查询效率
CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
Added 1 constraints
0 rows available after 205 ms, consumed after another 1 ms
导入用户User Node信息:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
MERGE (u:User{id:info["id"]})
ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
;
0 rows available after 99179 ms, consumed after another 0 ms
Added 500255 nodes, Set 18492458 properties, Added 500255 labels
共导入 500255 个用户信息,共耗时 99 秒,速度还不错,简单核查下结果。
neo4j> MATCH (n:User) RETURN COUNT(n);
+----------+
| COUNT(n) |
+----------+
| 500255 |
+----------+
1 row available after 10 ms, consumed after another 1 ms
导入用户follow relation信息:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
UNWIND info["d"] AS elem
WITH
info["type"] AS type,
CASE WHEN info["type"] = "Followee" THEN elem["followerId"] WHEN info["type"] = "Follower" THEN elem["follower"]["objectId"] END AS startId,
CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN elem["followeeId"] END AS endId ,
elem["createdAtString"] AS createdAtString,
elem["updatedAtString"] AS updatedAtString
MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
MERGE(startNode)-[r:follow]->(endNode)
ON CREATE SET r.createdAtString = createdAtString,
r.updatedAtString = updatedAtString
;
0 rows available after 746892 ms, consumed after another 0 ms
Created 7115107 relationships, Set 14230214 properties
共导入 7115107 条用户的关注follow关系,共耗时 746 秒,核查下结果。
neo4j> MATCH p=()-[r:follow]->() RETURN COUNT(r);
+----------+
| COUNT(r) |
+----------+
| 7115107 |
+----------+
1 row available after 7 ms, consumed after another 0 ms
数据初探
言归正传,在数据备好后,对掘友信息选取一些重要维度进行分析。
掘友用户增量情况
// 每月新增用户和关系情况
MATCH (u) WHERE EXISTS(u.createdAt)
WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
MATCH ()<-[r]-()
WITH userMap,r.createdAtString AS createdAtString
WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
WITH createdAtString, num AS followNum ,userMap[createdAtString] AS userNum
ORDER BY createdAtString
WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
UNWIND RANGE(0,LENGTH(infoList)-1) AS index
WITH infoList[index]["createdAtString"] AS createdAtString,
infoList[index]["userNum"] AS userNum,
infoList[index]["followNum"] AS followNum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
ORDER BY createdAtString DESC
LIMIT 10
;
+----------------------------------------------------------------------------------------+
| createdAtString | userNum | followNum | userNumSum | followNumSum | followPerUser |
+----------------------------------------------------------------------------------------+
| "2019-12" | 583 | 3436 | 496639.0 | 7115107.0 | 14.326516846240429 |
| "2019-11" | 10098 | 172500 | 496056.0 | 7111671.0 | 14.336427742029127 |
| "2019-10" | 11957 | 193671 | 485958.0 | 6939171.0 | 14.279363648710383 |
| "2019-09" | 11995 | 202601 | 474001.0 | 6745500.0 | 14.230982635057732 |
| "2019-08" | 12585 | 210558 | 462006.0 | 6542899.0 | 14.16193512638364 |
| "2019-07" | 14042 | 228781 | 449421.0 | 6332341.0 | 14.089998019674203 |
| "2019-06" | 11522 | 196869 | 435379.0 | 6103560.0 | 14.018958195043858 |
| "2019-05" | 12924 | 223838 | 423857.0 | 5906691.0 | 13.935574969860118 |
| "2019-04" | 13592 | 233482 | 410933.0 | 5682853.0 | 13.829147330586737 |
| "2019-03" | 15747 | 283769 | 397341.0 | 5449371.0 | 13.714595272071092 |
+----------------------------------------------------------------------------------------+
10 rows available after 55237 ms, consumed after another 0 ms
从分析结果来看,掘友用户增一直保持着很稳定的态势。同时,随着用户的增多,掘友之间的关系也相对增多,也体现了用户之间的交互更加密切,是用户用户活跃度的简介体现。
元老掘友分析
来看下掘友圈,那些跟随者掘金一起成长的最早的一批元老们。
// 粉丝数、最早注册等统计, 每日新增粉丝
MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
WITH u,r.createdAtString AS createdAtString
WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
WITH u, MAX(num ) AS maxNum
RETURN u.username,
apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
u.followersCount AS followersCount,
u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
maxNum
ORDER BY createdAt
// ORDER BY aveNum DESC
// ORDER BY maxNum DESC
LIMIT 5
;
- 最早注册掘金的元老们: ```bash +———————————————————————————————————-+ | u.username | createdAt | days | followersCount | aveNum | maxNum | +———————————————————————————————————-+ | “Ming_Zhe” | “2015-03-24” | 1719 | 20 | 0 | 3 | | “代码家” | “2015-03-24” | 1719 | 140 | 0 | 6 | | “稀土君” | “2015-04-02” | 1709 | 52904 | 30 | 308 | | “江昪” | “2015-04-02” | 1710 | 9110 | 5 | 68 | | “阴明” | “2015-04-02” | 1709 | 74262 | 43 | 286 | | “fskslafhd” | “2015-04-15” | 1697 | 8 | 0 | 2 | | “荆全齐” | “2015-04-16” | 1696 | 96 | 0 | 4 | | “晓风well” | “2015-04-16” | 1696 | 192 | 0 | 6 | | “开源小组” | “2015-04-16” | 1696 | 43 | 0 | 3 | | “Millie_Lin” | “2015-04-16” | 1696 | 7744 | 4 | 67 | +———————————————————————————————————-+
- 最吸粉的大佬们(平均每日新增粉丝),平均一天吸粉近60。
```bash
+--------------------------------------------------------------------------+
| u.username | createdAt | days | followersCount | aveNum | maxNum |
+--------------------------------------------------------------------------+
| "石杉的架构笔记" | "2018-11-05" | 396 | 22877 | 57 | 189 |
| "刘小夕" | "2019-02-12" | 298 | 17086 | 57 | 326 |
| "闲鱼技术" | "2018-04-03" | 613 | 27634 | 45 | 157 |
| "ConardLi" | "2018-11-13" | 389 | 17373 | 44 | 285 |
| "美团技术团队" | "2018-03-30" | 617 | 27576 | 44 | 283 |
| "阴明" | "2015-04-02" | 1709 | 74262 | 43 | 286 |
| "Java3y" | "2018-01-30" | 676 | 29522 | 43 | 144 |
| "腾讯云加社区" | "2017-02-24" | 1016 | 44415 | 43 | 297 |
| "小姐姐味道" | "2018-10-30" | 403 | 17025 | 42 | 157 |
| "ikoala" | "2019-06-01" | 188 | 7550 | 40 | 134 |
+--------------------------------------------------------------------------+
10 rows available after 55589 ms, consumed after another 0 ms
掘友粉丝分布
// 粉丝分布
MATCH (u:User)
WHERE EXISTS(u.followersCount)
RETURN COUNT(u.followersCount) AS count,
AVG(u.followersCount) AS ave,
percentileDisc(u.followersCount, 0.5) AS `50%`,
percentileDisc(u.followersCount, 0.75) AS `75%`,
percentileDisc(u.followersCount, 0.90) AS `90%`,
percentileDisc(u.followersCount, 0.95) AS `95%`,
percentileDisc(u.followersCount, 0.99) AS `99%`,
percentileDisc(u.followersCount, 0.999) AS `99.9%`,
percentileDisc(u.followersCount, 0.9999) AS `99.99%`,
percentileDisc(u.followersCount, 0.99999) AS `99.999%`,
percentileDisc(u.followersCount, 1) AS `100%`
;
+---------------------------------------------------------------------------------------------+
| count | ave | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100% |
+---------------------------------------------------------------------------------------------+
| 499934 | 15.86762652670079 | 0 | 0 | 1 | 2 | 58 | 3721 | 17025 | 37657 | 74262 |
+---------------------------------------------------------------------------------------------+
1 row available after 4029 ms, consumed after another 0 ms
可以看到超过75%的掘友还没有粉丝额。要想排在掘友 top 1 内,也至少有58个粉丝。再看下 top1内的大佬吸粉情况:
// 粉丝占比
MATCH (u:User)
WHERE EXISTS(u.followersCount)
WITH
SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
SUM(u.followersCount) AS followersCountSum
RETURN low,high,followersCountSum,
low*1.0/followersCountSum AS lowPercent,
high*1.0/followersCountSum AS highPercent
;
+---------------------------------------------------------------------------------+
| low | high | followersCountSum | lowPercent | highPercent |
+---------------------------------------------------------------------------------+
| 210930 | 7721836 | 7932766 | 0.02658971662595367 | 0.9734102833740463 |
+---------------------------------------------------------------------------------+
1 row available after 1399 ms, consumed after another 0 ms
果然,“97% 的粉丝都掌握在 1% 的 大佬手中”,掘友的粉丝贫富差距肉眼可见。
那些高产的大佬们
// 发布文章最多的,发布沸点最多的
MATCH (u:User) WHERE
// EXISTS(u.pinCount)
EXISTS(u.postedPostsCount)
RETURN u.id AS id,
u.username AS username,
u.pinCount AS pinCount,
u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS pinCountAve,
u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS postedPostsCountAve,
u.totalCollectionsCount AS totalCollectionsCount,
u.postedPostsCount AS postedPostsCount,
u.totalViewsCount AS totalViewsCount
ORDER BY postedPostsCount DESC
LIMIT 10
;
- 发布文章最多的大佬们: ```bash
+——————————————————————————————————————+ | username | postedPostsCountAve | postedPostsCount | totalViewsCount | +——————————————————————————————————————+ | “乌云知识库” | 1 | 1224 | 18030 | | “刘旷” | 1 | 1066 | 8336 | | “千锋JAVA开发” | 1 | 996 | 23818 | | “阿里云云栖社区” | 0 | 932 | 152109 | | “芊宝宝” | 1 | 922 | 19598 | | “LLrZqSg3” | 2 | 913 | 506 | | “go4it” | 0 | 780 | 153028 | | “JerryWang_sap” | 1 | 771 | 7178 | | “腾讯云加社区” | 0 | 759 | 685628 | | “云计算百科” | 1 | 754 | 6543 | +——————————————————————————————————————+
10 rows available after 1745 ms, consumed after another 0 ms
- 发布沸点最多的大佬们
```bash
+--------------------------------------------------------------+
| username | pinCount | pinCountAve | totalViewsCount |
+--------------------------------------------------------------+
| "网路冷眼" | 19397 | 21 | NULL |
| "湾区日报" | 6836 | 10 | 0 |
| "神奇的命令行" | 1989 | 1 | 108035 |
| "程序员趣事" | 1583 | 2 | 0 |
| "树洞robot" | 1530 | 4 | 0 |
| "HackerNews什么值 | 1510 | 4 | 0 |
| "爱可可-爱生活" | 1484 | 4 | 58 |
| "科技新闻搬运工" | 1476 | 2 | 0 |
| "娱乐小编" | 1283 | 2 | 2761 |
| "王兴的饭否" | 1264 | 1 | 0 |
+--------------------------------------------------------------+
10 rows available after 726 ms, consumed after another 1 ms
影响力分析挖掘
前面对掘友圈进行了粗略分析,粉丝数、文章数等都是单维度的,那么最有影响力掘友该如何分析,采用pageRank进行建模,如下
// pageRank 影响力挖掘分析
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH u0,algo.getNodeById(nodeId) AS u, score
WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
RETURN
u.username AS username,
score,hasRelation
ORDER BY score DESC
LIMIT 20
;
+---------------------------------------------------+
| username | score | hasRelation |
+---------------------------------------------------+
| "阴明" | 5492.921815348415 | 0 |
| "HollisChuang" | 3675.980443021097 | 1 |
| "漫话编程" | 3543.827678073035 | 1 |
| "超人汪小建" | 3378.3619996590537 | 1 |
| "稀土君" | 2878.5440236374734 | 1 |
| "stormzhangV" | 2855.733890181873 | 0 |
| "前端外刊评论" | 2841.045798705519 | 1 |
| "liutao" | 2481.8655670162293 | 1 |
| "膜法小编" | 2047.3083251186647 | 1 |
| "LucasHC" | 2030.568352384912 | 0 |
| "NeXT" | 1922.9859172880645 | 1 |
| "李CHENGXI" | 1752.8077864374964 | 1 |
| "水墨寒" | 1528.3845290698114 | 1 |
| "清蒸不是水煮" | 1471.3738727076911 | 1 |
| "蚂蚁金服数据.." | 1287.2166685772129 | 1 |
| "丁一" | 1252.5015809553672 | 1 |
| "美团技术团队" | 1250.5822049211245 | 1 |
| "江昪" | 1234.908881070744 | 1 |
| "腾讯云加社区" | 1169.060374834016 | 1 |
| "薄荷前端" | 1093.3464047224304 | 0 |
+---------------------------------------------------+
20 rows available after 5093 ms, consumed after another 0 ms
// pageRank 影响力前10 的用户之间的关系
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS u, score
WITH u,score ORDER BY score DESC LIMIT 10
WITH COLLECT(u) AS Users
UNWIND Users AS u1
UNWIND Users AS u2
MATCH p=(u1)-[r:follow]-(u2)
RETURN p;
影响力前10名的的大佬之间,错综复杂,剪不断理还乱的关系,果然,大佬们都是在玩圈子啊。
互粉情况分析
话说,互粉是中华民族的传统美德,来看下,哪些大拿和粉丝互粉互动较多呢?
// 互粉的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
RETURN COUNT(DISTINCT u1.id+u2.id) AS num
;
+-------+
| num |
+-------+
| 22732 |
+-------+
1 row available after 32644 ms, consumed after another 0 ms
可以看出,掘友圈友超过22万互粉用户。
// 自我关注的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
RETURN COUNT(DISTINCT u1.id) AS num
;
+-----+
| num |
+-----+
| 147 |
+-----+
1 row available after 0 ms, consumed after another 1351 ms
当然,也不乏自我互粉的掘友,你是这147位掘友中的一员吗。
// 互粉最多的用户
MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
WITH u1,u2
WITH u1,COUNT(u2) AS num
RETURN u1.username,num
ORDER BY num DESC
LIMIT 20
;
+--------------------------+
| u1.username | num |
+--------------------------+
| "liutao" | 641 |
| "膜法小编" | 449 |
| "阴明" | 367 |
| "黑马UI" | 322 |
| "pilishen" | 315 |
| "拥抱心中的梦想" | 296 |
| "powerzhuye" | 260 |
| "zhennann" | 237 |
| "Mockplus" | 230 |
| "MarvinZhang" | 184 |
| "清蒸不是水煮" | 176 |
| "一颗香菜" | 154 |
| "断天涯大虾" | 152 |
| "爱原型爱设计" | 151 |
| "闻人的技术博客" | 141 |
| "bytedance" | 139 |
| "Bug开发者" | 136 |
| "江昪" | 123 |
| "NervosNetwork" | 118 |
+--------------------------+
20 rows available after 23 ms, consumed after another 35454 ms
最短路径分析
neo4j 提供了方便快捷的查询给定2个节点之间最短路径函数shortestPath,通过制定
- 寻找 saiwaiyanyu 和 阴明 在9度关系内的最短路径。 ```bash // 最短路径 MATCH path = shortestPath((u1:User{id:”5a8d0c5df265da4e9d223ba5”})-[:follow*1..9]-(u2:User{id:”551d677ee4b0cd5b623f49cb”})) RETURN path,u1.username,u2.username LIMIT 1;
![image.png](https://cdn.nlark.com/yuque/0/2021/png/265643/1625537061945-0568b22d-52a8-44e9-a316-6106a13d9dd8.png#clientId=u83db8637-720f-4&from=paste&id=ua7a2234b&margin=%5Bobject%20Object%5D&name=image.png&originHeight=61&originWidth=226&originalType=url&ratio=1&size=10733&status=done&style=none&taskId=u5ce368b3-4ed6-4529-a8a9-9803e0bec9c)
- 寻找 saiwaiyanyu 和 头条搜索前端团队 的字段路径
```bash
// 最短路径
MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"5de0f0a0f265da06113c700f"}))
RETURN path,u1.username,u2.username
LIMIT 1
最长径分析
通过shortestPath能够方便查询给定点之间的最短关系,如果需要查询比较长的关系路径,该如何实现呢。
- 查询存在6度关系的节点: ```bash
// 最长路径存在6度关系 MATCH path = (u1:User)-[:follow*6..6]-(u2:User) WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2) RETURN path LIMIT 1;
![image.png](https://cdn.nlark.com/yuque/0/2021/png/265643/1625537061972-c9810e48-0b73-48be-8619-c5a51f813779.png#clientId=u83db8637-720f-4&from=paste&id=uf7c89765&margin=%5Bobject%20Object%5D&name=image.png&originHeight=185&originWidth=538&originalType=url&ratio=1&size=39559&status=done&style=none&taskId=u155e4675-50a5-4d49-8577-8a952da817e)<br />在茫茫人海中,任何2个用户总能找到他们的关联,无论是隔壁邻居的八大姨的三外甥的同学。。。
- 查询存在6度关系的节点,且指定follow方向:
```bash
// 最长路径存在6度关系
MATCH path = (u1:User)<-[:follow*6..6]-(u2:User)
WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
RETURN path
LIMIT 1
;
这一条粉丝关系路径算是比较长了。
- 粉丝关系形成闭环的情形
// 最长路径MATCH (u:User) WHERE NOT EXISTS( (u)<-[:follow*1..6]-(u) )
MATCH path = (u)<-[:follow*7..7]-(u)
WHERE ALL(u IN [u0 IN FILTER(n IN NODES(path) WHERE ID(n)<>ID(u)) | size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
RETURN u.username,path, LENGTH(path)
LIMIT 1
;
关于用户推荐
推荐算法比较经典的有基于用户和基于物品,以及今年应用越来越广的基于深度学习模型的推荐。
因采集到的用户数据维度较少,用户的文章、沸点、行为数据等缺失,因此计算用户相似度等存在一些问题。尝试用用户关注关系实现一个简单的推荐。
- 关注他的用户也关注了
给定用户,通过查询该用户粉丝关注的其他用户,查看粉丝群的共同关注。
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
MATCH (u0)<-[:follow]-(u1:User)-[:follow]->(g:User)
WHERE NOT ID(u0)=ID(g)
WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum
RETURN u0.username,u0.id, g.username,g.id, sameFollowNum
ORDER BY sameFollowNum DESC
LIMIT 20
;
u0.username | g.username | sameFollowNum |
---|---|---|
阴明 | 前端外刊评论 | 21304 |
阴明 | 李CHENGXI | 21286 |
阴明 | 水墨寒 | 20161 |
阴明 | 稀土君 | 15042 |
阴明 | liutao | 10898 |
阴明 | 刘欣 | 9434 |
阴明 | NeXT | 8770 |
阴明 | 丁一 | 8741 |
阴明 | 雨神姥爷 | 8454 |
阴明 | taotao.li | 8392 |
阴明 | 闲鱼技术 | 8290 |
阴明 | HollisChuang | 8050 |
阴明 | 吆喝科技_Zoran | 7810 |
阴明 | 腾讯云加社区 | 7554 |
阴明 | 杨昕霖 | 7391 |
阴明 | 超人汪小建 | 7172 |
阴明 | 可乐橙 | 7137 |
阴明 | 饿了么UED | 7039 |
阴明 | JJ-Ying | 7039 |
阴明 | stormzhangV | 6910 |
从结果看,21304 用户既关注了阴明,同时也关注了前端外刊评论。
- 和我一起关注的用户也关注了他 ```bash // 和类似的用户也关注了 MATCH(u0:User{id:’5a8d0c5df265da4e9d223ba5’}) MATCH (u0)-[:follow]->(:User)<-[:follow]-(u1)-[:follow]->(g:User) WHERE NOT (u0)-[:follow]->(g) WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum RETURN u0.username, g.username, sameFollowNum ORDER BY sameFollowNum DESC LIMIT 10 ;
| u0.username | g.username | sameFolloweeNum | sameFollowNum |
| --- | --- | --- | --- |
| saiwaiyanyu | 稀土君 | 24 | 23476 |
| saiwaiyanyu | 前端外刊评论 | 23 | 21760 |
| saiwaiyanyu | 李CHENGXI | 23 | 21662 |
| saiwaiyanyu | 水墨寒 | 23 | 20683 |
| saiwaiyanyu | 腾讯云加社区 | 23 | 18890 |
| saiwaiyanyu | liutao | 24 | 16694 |
| saiwaiyanyu | 超人汪小建 | 24 | 13954 |
| saiwaiyanyu | 闲鱼技术 | 23 | 12828 |
| saiwaiyanyu | Java3y | 24 | 12746 |
| saiwaiyanyu | 美团技术团队 | 24 | 12667 |
从结果看出有23476 用户,关注了 saiwaiyanyu关注的用户,且这23476也关注了 稀土君,可以尝试将稀土君推荐给saiwaiyanyu。
<a name="KoZbK"></a>
## 代码
```bash
// 创建主键
CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
// 导入节点
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
MERGE (u:User{id:info["id"]})
ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
;
// 导入关系
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
UNWIND info["d"] AS elem
WITH
info["type"] AS type,
CASE WHEN info["type"] = "Followee" THEN elem["followerId"] WHEN info["type"] = "Follower" THEN elem["follower"]["objectId"] END AS startId,
CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN elem["followeeId"] END AS endId ,
elem["createdAtString"] AS createdAtString,
elem["updatedAtString"] AS updatedAtString
MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
MERGE(startNode)-[r:follow]->(endNode)
ON CREATE SET r.createdAtString = createdAtString,
r.updatedAtString = updatedAtString
;
// 每月新增用户和关系情况
MATCH (u) WHERE EXISTS(u.createdAt)
WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
MATCH ()<-[r]-()
WITH userMap,r.createdAtString AS createdAtString
WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
WITH createdAtString, num AS followNum ,userMap[createdAtString] AS userNum
ORDER BY createdAtString
WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
UNWIND RANGE(0,LENGTH(infoList)-1) AS index
WITH infoList[index]["createdAtString"] AS createdAtString,
infoList[index]["userNum"] AS userNum,
infoList[index]["followNum"] AS followNum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
ORDER BY createdAtString DESC
LIMIT 10
;
// 粉丝数、最早注册等统计, 每日新增粉丝
MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
WITH u,r.createdAtString AS createdAtString
WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
WITH u, MAX(num ) AS maxNum
RETURN u.username,
apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
u.followersCount AS followersCount,
u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
maxNum
// ORDER BY createdAt
ORDER BY aveNum DESC
// ORDER BY maxNum DESC
LIMIT 10
;
// 粉丝分布
MATCH (u:User)
WHERE EXISTS(u.followersCount)
RETURN COUNT(u.followersCount) AS count,
AVG(u.followersCount) AS ave,
percentileDisc(u.followersCount, 0.5) AS `50%`,
percentileDisc(u.followersCount, 0.75) AS `75%`,
percentileDisc(u.followersCount, 0.90) AS `90%`,
percentileDisc(u.followersCount, 0.95) AS `95%`,
percentileDisc(u.followersCount, 0.99) AS `99%`,
percentileDisc(u.followersCount, 0.999) AS `99.9%`,
percentileDisc(u.followersCount, 0.9999) AS `99.99%`,
percentileDisc(u.followersCount, 0.99999) AS `99.999%`,
percentileDisc(u.followersCount, 1) AS `100%`
;
// 粉丝占比
MATCH (u:User)
WHERE EXISTS(u.followersCount)
WITH
SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
SUM(u.followersCount) AS followersCountSum
RETURN low,high,followersCountSum,
low*1.0/followersCountSum AS lowPercent,
high*1.0/followersCountSum AS highPercent
;
// pageRank 影响力挖掘分析
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH u0,algo.getNodeById(nodeId) AS u, score
WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
RETURN
u.username AS username,
score,hasRelation
ORDER BY score DESC
LIMIT 20
;
// pageRank 影响力前10 的用户之间的关系
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS u, score
WITH u,score ORDER BY score DESC LIMIT 10
WITH COLLECT(u) AS Users
UNWIND Users AS u1
UNWIND Users AS u2
MATCH p=(u1)-[r:follow]-(u2)
RETURN p;
// 互粉的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
RETURN COUNT(DISTINCT u1.id+u2.id) AS num
;
// 自我互粉的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
RETURN COUNT(DISTINCT u1.id) AS num
;
// 互粉最多的用户
MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
WITH u1,u2
WITH u1,COUNT(u2) AS num
RETURN u1.username,num
ORDER BY num DESC
LIMIT 20
;
// 最长路径
MATCH path = (u1:User)<-[:follow*9..9]-(u2:User)
WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
RETURN path, LENGTH(path)
// ORDER BY LENGTH(path) DESC
LIMIT 1
;
// 发布文章最多的,发布沸点最多的
MATCH (u:User) WHERE
// EXISTS(u.pinCount)
EXISTS(u.postedPostsCount)
RETURN u.id AS id,
u.username AS username,
u.pinCount AS pinCount,
u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS pinCountAve,
u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS postedPostsCountAve,
u.totalCollectionsCount AS totalCollectionsCount,
u.postedPostsCount AS postedPostsCount,
u.totalViewsCount AS totalViewsCount
ORDER BY postedPostsCount DESC
LIMIT 10
;
// 最短路径
MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"551d677ee4b0cd5b623f49cb"}))
RETURN path,u1.username,u2.username
LIMIT 1;
// 最短路径
MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"5de0f0a0f265da06113c700f"}))
RETURN path,u1.username,u2.username
LIMIT 1
;
// 最长路径
MATCH path = (u1:User)-[:follow*6..6]-(u2:User)
WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
RETURN path
LIMIT 1;
// 最长路径
MATCH (u:User) WHERE NOT EXISTS( (u)<-[:follow*1..6]-(u) )
MATCH path = (u)<-[:follow*7..7]-(u)
WHERE ALL(u IN [u0 IN FILTER(n IN NODES(path) WHERE ID(n)<>ID(u)) | size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
RETURN u.username,path, LENGTH(path)
// ORDER BY LENGTH(path) DESC
LIMIT 1
;
// 关注他的用户也关注了
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
MATCH (u0)<-[:follow]-(u1:User)-[:follow]->(g:User)
WHERE NOT ID(u0)=ID(g)
WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum
RETURN u0.username,u0.id, g.username,g.id, sameFollowNum
ORDER BY sameFollowNum DESC
LIMIT 20
;
// 和类似的用户也关注了
MATCH(u0:User{id:'5a8d0c5df265da4e9d223ba5'})
MATCH (u0)-[:follow]->(s:User)<-[:follow]-(u1)-[:follow]->(g:User)
WHERE NOT (u0)-[:follow]->(g)
WITH u0, g,COUNT(DISTINCT s) AS sameFolloweeNum, COUNT(DISTINCT u1) AS sameFollowNum
RETURN u0.username, g.username, sameFolloweeNum,sameFollowNum
ORDER BY sameFollowNum DESC
LIMIT 10
;
作者:saiwaiyanyu
链接:https://juejin.cn/post/6844904020440514574
来源:掘金
分析代码:github.com/saiwaiyanyu…