如何用neo4j对50万掘金用户进行分析

数据获取

利用 python的requests 写一个简单的爬虫,以站长 阴明 为起点,利用对掘金用户的follower和 followee 关系进行深度和广度遍历抓取,共获取到50万用户(目前掘友总数据估计在200万以上),711万的follow关系。对于没有follow关系的用户,暂不做分析。
follow数据样例:

  1. {
  2. "objectId": "56887f661482e8a3f10d5061",
  3. "follower": {
  4. "objectId": "56887e3e60b2a099cdd3d8df",
  5. "username": "逗论",
  6. ...
  7. },
  8. "followeeId": "557f9654e4b0d02dc2d7125d",
  9. "createdAtString": "2016-01-03T01:54:46.919Z",
  10. "updatedAtString": "2016-01-03T01:54:46.947Z"
  11. }

user数据样例:

  1. {
  2. "objectId": "5946111561ff4b006ced8427",
  3. "followersCount": 0,
  4. "followeesCount": 1,
  5. "username": "abu1497764117405",
  6. ...
  7. }

关于 neo4j

neo4j是当前知识图谱领域,一款非常优秀的图数据存储和分析挖掘的主流(数据库)工具,当前已经更新至4.0版本。通过neo4j能够实现上亿级别的实体和关系挖掘。本文以掘友用户User和关注follow关系网分析为主,构建的(实体A,关系,实体B)三元关系组相对简单,主要为(User,follow,User)。如下图
image.png

数据导入

neo4j 导入数据方法有很多,如LOAD CSV能够读取CSV文件、neo4j-import能够对上亿级别数据快速初始化导入、apoc-jdbc能够直连常见关系型数据库mysql\oarcle等获取数据。
为简便使用,本次采集的数据以json格式存放在文件中。导入方式选择读取文件,加载json string并解析。
启动neo4j shell:

  1. $ ./bin/cypher-shell

给User添加索引,提高后续查询效率

  1. CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
  2. Added 1 constraints
  3. 0 rows available after 205 ms, consumed after another 1 ms

导入用户User Node信息:

  1. USING PERIODIC COMMIT 5000
  2. LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
  3. WITH row
  4. WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
  5. MERGE (u:User{id:info["id"]})
  6. ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
  7. ;
  8. 0 rows available after 99179 ms, consumed after another 0 ms
  9. Added 500255 nodes, Set 18492458 properties, Added 500255 labels

共导入 500255 个用户信息,共耗时 99 秒,速度还不错,简单核查下结果。

  1. neo4j> MATCH (n:User) RETURN COUNT(n);
  2. +----------+
  3. | COUNT(n) |
  4. +----------+
  5. | 500255 |
  6. +----------+
  7. 1 row available after 10 ms, consumed after another 1 ms

导入用户follow relation信息:

  1. USING PERIODIC COMMIT 5000
  2. LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
  3. WITH row
  4. WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
  5. UNWIND info["d"] AS elem
  6. WITH
  7. info["type"] AS type,
  8. CASE WHEN info["type"] = "Followee" THEN elem["followerId"] WHEN info["type"] = "Follower" THEN elem["follower"]["objectId"] END AS startId,
  9. CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN elem["followeeId"] END AS endId ,
  10. elem["createdAtString"] AS createdAtString,
  11. elem["updatedAtString"] AS updatedAtString
  12. MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
  13. MERGE(startNode)-[r:follow]->(endNode)
  14. ON CREATE SET r.createdAtString = createdAtString,
  15. r.updatedAtString = updatedAtString
  16. ;
  17. 0 rows available after 746892 ms, consumed after another 0 ms
  18. Created 7115107 relationships, Set 14230214 properties

共导入 7115107 条用户的关注follow关系,共耗时 746 秒,核查下结果。

  1. neo4j> MATCH p=()-[r:follow]->() RETURN COUNT(r);
  2. +----------+
  3. | COUNT(r) |
  4. +----------+
  5. | 7115107 |
  6. +----------+
  7. 1 row available after 7 ms, consumed after another 0 ms

数据初探

言归正传,在数据备好后,对掘友信息选取一些重要维度进行分析。

掘友用户增量情况

  1. // 每月新增用户和关系情况
  2. MATCH (u) WHERE EXISTS(u.createdAt)
  3. WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
  4. WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
  5. MATCH ()<-[r]-()
  6. WITH userMap,r.createdAtString AS createdAtString
  7. WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
  8. WITH createdAtString, num AS followNum ,userMap[createdAtString] AS userNum
  9. ORDER BY createdAtString
  10. WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
  11. UNWIND RANGE(0,LENGTH(infoList)-1) AS index
  12. WITH infoList[index]["createdAtString"] AS createdAtString,
  13. infoList[index]["userNum"] AS userNum,
  14. infoList[index]["followNum"] AS followNum,
  15. apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
  16. apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
  17. RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
  18. ORDER BY createdAtString DESC
  19. LIMIT 10
  20. ;
  1. +----------------------------------------------------------------------------------------+
  2. | createdAtString | userNum | followNum | userNumSum | followNumSum | followPerUser |
  3. +----------------------------------------------------------------------------------------+
  4. | "2019-12" | 583 | 3436 | 496639.0 | 7115107.0 | 14.326516846240429 |
  5. | "2019-11" | 10098 | 172500 | 496056.0 | 7111671.0 | 14.336427742029127 |
  6. | "2019-10" | 11957 | 193671 | 485958.0 | 6939171.0 | 14.279363648710383 |
  7. | "2019-09" | 11995 | 202601 | 474001.0 | 6745500.0 | 14.230982635057732 |
  8. | "2019-08" | 12585 | 210558 | 462006.0 | 6542899.0 | 14.16193512638364 |
  9. | "2019-07" | 14042 | 228781 | 449421.0 | 6332341.0 | 14.089998019674203 |
  10. | "2019-06" | 11522 | 196869 | 435379.0 | 6103560.0 | 14.018958195043858 |
  11. | "2019-05" | 12924 | 223838 | 423857.0 | 5906691.0 | 13.935574969860118 |
  12. | "2019-04" | 13592 | 233482 | 410933.0 | 5682853.0 | 13.829147330586737 |
  13. | "2019-03" | 15747 | 283769 | 397341.0 | 5449371.0 | 13.714595272071092 |
  14. +----------------------------------------------------------------------------------------+
  15. 10 rows available after 55237 ms, consumed after another 0 ms

image.png
image.png
从分析结果来看,掘友用户增一直保持着很稳定的态势。同时,随着用户的增多,掘友之间的关系也相对增多,也体现了用户之间的交互更加密切,是用户用户活跃度的简介体现。

元老掘友分析

来看下掘友圈,那些跟随者掘金一起成长的最早的一批元老们。

  1. // 粉丝数、最早注册等统计, 每日新增粉丝
  2. MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
  3. WITH u,r.createdAtString AS createdAtString
  4. WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
  5. WITH u, MAX(num ) AS maxNum
  6. RETURN u.username,
  7. apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
  8. apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
  9. u.followersCount AS followersCount,
  10. u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
  11. maxNum
  12. ORDER BY createdAt
  13. // ORDER BY aveNum DESC
  14. // ORDER BY maxNum DESC
  15. LIMIT 5
  16. ;
  • 最早注册掘金的元老们: ```bash +———————————————————————————————————-+ | u.username | createdAt | days | followersCount | aveNum | maxNum | +———————————————————————————————————-+ | “Ming_Zhe” | “2015-03-24” | 1719 | 20 | 0 | 3 | | “代码家” | “2015-03-24” | 1719 | 140 | 0 | 6 | | “稀土君” | “2015-04-02” | 1709 | 52904 | 30 | 308 | | “江昪” | “2015-04-02” | 1710 | 9110 | 5 | 68 | | “阴明” | “2015-04-02” | 1709 | 74262 | 43 | 286 | | “fskslafhd” | “2015-04-15” | 1697 | 8 | 0 | 2 | | “荆全齐” | “2015-04-16” | 1696 | 96 | 0 | 4 | | “晓风well” | “2015-04-16” | 1696 | 192 | 0 | 6 | | “开源小组” | “2015-04-16” | 1696 | 43 | 0 | 3 | | “Millie_Lin” | “2015-04-16” | 1696 | 7744 | 4 | 67 | +———————————————————————————————————-+
  1. - 最吸粉的大佬们(平均每日新增粉丝),平均一天吸粉近60
  2. ```bash
  3. +--------------------------------------------------------------------------+
  4. | u.username | createdAt | days | followersCount | aveNum | maxNum |
  5. +--------------------------------------------------------------------------+
  6. | "石杉的架构笔记" | "2018-11-05" | 396 | 22877 | 57 | 189 |
  7. | "刘小夕" | "2019-02-12" | 298 | 17086 | 57 | 326 |
  8. | "闲鱼技术" | "2018-04-03" | 613 | 27634 | 45 | 157 |
  9. | "ConardLi" | "2018-11-13" | 389 | 17373 | 44 | 285 |
  10. | "美团技术团队" | "2018-03-30" | 617 | 27576 | 44 | 283 |
  11. | "阴明" | "2015-04-02" | 1709 | 74262 | 43 | 286 |
  12. | "Java3y" | "2018-01-30" | 676 | 29522 | 43 | 144 |
  13. | "腾讯云加社区" | "2017-02-24" | 1016 | 44415 | 43 | 297 |
  14. | "小姐姐味道" | "2018-10-30" | 403 | 17025 | 42 | 157 |
  15. | "ikoala" | "2019-06-01" | 188 | 7550 | 40 | 134 |
  16. +--------------------------------------------------------------------------+
  17. 10 rows available after 55589 ms, consumed after another 0 ms

掘友粉丝分布

  1. // 粉丝分布
  2. MATCH (u:User)
  3. WHERE EXISTS(u.followersCount)
  4. RETURN COUNT(u.followersCount) AS count,
  5. AVG(u.followersCount) AS ave,
  6. percentileDisc(u.followersCount, 0.5) AS `50%`,
  7. percentileDisc(u.followersCount, 0.75) AS `75%`,
  8. percentileDisc(u.followersCount, 0.90) AS `90%`,
  9. percentileDisc(u.followersCount, 0.95) AS `95%`,
  10. percentileDisc(u.followersCount, 0.99) AS `99%`,
  11. percentileDisc(u.followersCount, 0.999) AS `99.9%`,
  12. percentileDisc(u.followersCount, 0.9999) AS `99.99%`,
  13. percentileDisc(u.followersCount, 0.99999) AS `99.999%`,
  14. percentileDisc(u.followersCount, 1) AS `100%`
  15. ;
  1. +---------------------------------------------------------------------------------------------+
  2. | count | ave | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100% |
  3. +---------------------------------------------------------------------------------------------+
  4. | 499934 | 15.86762652670079 | 0 | 0 | 1 | 2 | 58 | 3721 | 17025 | 37657 | 74262 |
  5. +---------------------------------------------------------------------------------------------+
  6. 1 row available after 4029 ms, consumed after another 0 ms

可以看到超过75%的掘友还没有粉丝额。要想排在掘友 top 1 内,也至少有58个粉丝。再看下 top1内的大佬吸粉情况:

  1. // 粉丝占比
  2. MATCH (u:User)
  3. WHERE EXISTS(u.followersCount)
  4. WITH
  5. SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
  6. SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
  7. SUM(u.followersCount) AS followersCountSum
  8. RETURN low,high,followersCountSum,
  9. low*1.0/followersCountSum AS lowPercent,
  10. high*1.0/followersCountSum AS highPercent
  11. ;
  1. +---------------------------------------------------------------------------------+
  2. | low | high | followersCountSum | lowPercent | highPercent |
  3. +---------------------------------------------------------------------------------+
  4. | 210930 | 7721836 | 7932766 | 0.02658971662595367 | 0.9734102833740463 |
  5. +---------------------------------------------------------------------------------+
  6. 1 row available after 1399 ms, consumed after another 0 ms

果然,“97% 的粉丝都掌握在 1% 的 大佬手中”,掘友的粉丝贫富差距肉眼可见。

那些高产的大佬们

  1. // 发布文章最多的,发布沸点最多的
  2. MATCH (u:User) WHERE
  3. // EXISTS(u.pinCount)
  4. EXISTS(u.postedPostsCount)
  5. RETURN u.id AS id,
  6. u.username AS username,
  7. u.pinCount AS pinCount,
  8. u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS pinCountAve,
  9. u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS postedPostsCountAve,
  10. u.totalCollectionsCount AS totalCollectionsCount,
  11. u.postedPostsCount AS postedPostsCount,
  12. u.totalViewsCount AS totalViewsCount
  13. ORDER BY postedPostsCount DESC
  14. LIMIT 10
  15. ;
  • 发布文章最多的大佬们: ```bash

+——————————————————————————————————————+ | username | postedPostsCountAve | postedPostsCount | totalViewsCount | +——————————————————————————————————————+ | “乌云知识库” | 1 | 1224 | 18030 | | “刘旷” | 1 | 1066 | 8336 | | “千锋JAVA开发” | 1 | 996 | 23818 | | “阿里云云栖社区” | 0 | 932 | 152109 | | “芊宝宝” | 1 | 922 | 19598 | | “LLrZqSg3” | 2 | 913 | 506 | | “go4it” | 0 | 780 | 153028 | | “JerryWang_sap” | 1 | 771 | 7178 | | “腾讯云加社区” | 0 | 759 | 685628 | | “云计算百科” | 1 | 754 | 6543 | +——————————————————————————————————————+

10 rows available after 1745 ms, consumed after another 0 ms

  1. - 发布沸点最多的大佬们
  2. ```bash
  3. +--------------------------------------------------------------+
  4. | username | pinCount | pinCountAve | totalViewsCount |
  5. +--------------------------------------------------------------+
  6. | "网路冷眼" | 19397 | 21 | NULL |
  7. | "湾区日报" | 6836 | 10 | 0 |
  8. | "神奇的命令行" | 1989 | 1 | 108035 |
  9. | "程序员趣事" | 1583 | 2 | 0 |
  10. | "树洞robot" | 1530 | 4 | 0 |
  11. | "HackerNews什么值 | 1510 | 4 | 0 |
  12. | "爱可可-爱生活" | 1484 | 4 | 58 |
  13. | "科技新闻搬运工" | 1476 | 2 | 0 |
  14. | "娱乐小编" | 1283 | 2 | 2761 |
  15. | "王兴的饭否" | 1264 | 1 | 0 |
  16. +--------------------------------------------------------------+
  17. 10 rows available after 726 ms, consumed after another 1 ms

影响力分析挖掘

前面对掘友圈进行了粗略分析,粉丝数、文章数等都是单维度的,那么最有影响力掘友该如何分析,采用pageRank进行建模,如下

  1. // pageRank 影响力挖掘分析
  2. MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
  3. CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
  4. WITH u0,algo.getNodeById(nodeId) AS u, score
  5. WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
  6. RETURN
  7. u.username AS username,
  8. score,hasRelation
  9. ORDER BY score DESC
  10. LIMIT 20
  11. ;
  1. +---------------------------------------------------+
  2. | username | score | hasRelation |
  3. +---------------------------------------------------+
  4. | "阴明" | 5492.921815348415 | 0 |
  5. | "HollisChuang" | 3675.980443021097 | 1 |
  6. | "漫话编程" | 3543.827678073035 | 1 |
  7. | "超人汪小建" | 3378.3619996590537 | 1 |
  8. | "稀土君" | 2878.5440236374734 | 1 |
  9. | "stormzhangV" | 2855.733890181873 | 0 |
  10. | "前端外刊评论" | 2841.045798705519 | 1 |
  11. | "liutao" | 2481.8655670162293 | 1 |
  12. | "膜法小编" | 2047.3083251186647 | 1 |
  13. | "LucasHC" | 2030.568352384912 | 0 |
  14. | "NeXT" | 1922.9859172880645 | 1 |
  15. | "李CHENGXI" | 1752.8077864374964 | 1 |
  16. | "水墨寒" | 1528.3845290698114 | 1 |
  17. | "清蒸不是水煮" | 1471.3738727076911 | 1 |
  18. | "蚂蚁金服数据.." | 1287.2166685772129 | 1 |
  19. | "丁一" | 1252.5015809553672 | 1 |
  20. | "美团技术团队" | 1250.5822049211245 | 1 |
  21. | "江昪" | 1234.908881070744 | 1 |
  22. | "腾讯云加社区" | 1169.060374834016 | 1 |
  23. | "薄荷前端" | 1093.3464047224304 | 0 |
  24. +---------------------------------------------------+
  25. 20 rows available after 5093 ms, consumed after another 0 ms
  1. // pageRank 影响力前10 的用户之间的关系
  2. CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
  3. WITH algo.getNodeById(nodeId) AS u, score
  4. WITH u,score ORDER BY score DESC LIMIT 10
  5. WITH COLLECT(u) AS Users
  6. UNWIND Users AS u1
  7. UNWIND Users AS u2
  8. MATCH p=(u1)-[r:follow]-(u2)
  9. RETURN p;

image.png
影响力前10名的的大佬之间,错综复杂,剪不断理还乱的关系,果然,大佬们都是在玩圈子啊。

互粉情况分析

话说,互粉是中华民族的传统美德,来看下,哪些大拿和粉丝互粉互动较多呢?

  1. // 互粉的情况
  2. MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
  3. MATCH (u1)-[:follow]->(u2)
  4. RETURN COUNT(DISTINCT u1.id+u2.id) AS num
  5. ;
  1. +-------+
  2. | num |
  3. +-------+
  4. | 22732 |
  5. +-------+
  6. 1 row available after 32644 ms, consumed after another 0 ms

可以看出,掘友圈友超过22万互粉用户。

  1. // 自我关注的情况
  2. MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
  3. RETURN COUNT(DISTINCT u1.id) AS num
  4. ;
  1. +-----+
  2. | num |
  3. +-----+
  4. | 147 |
  5. +-----+
  6. 1 row available after 0 ms, consumed after another 1351 ms

当然,也不乏自我互粉的掘友,你是这147位掘友中的一员吗。

  1. // 互粉最多的用户
  2. MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
  3. MATCH (u1)-[:follow]->(u2)
  4. WITH u1,u2
  5. WITH u1,COUNT(u2) AS num
  6. RETURN u1.username,num
  7. ORDER BY num DESC
  8. LIMIT 20
  9. ;
  1. +--------------------------+
  2. | u1.username | num |
  3. +--------------------------+
  4. | "liutao" | 641 |
  5. | "膜法小编" | 449 |
  6. | "阴明" | 367 |
  7. | "黑马UI" | 322 |
  8. | "pilishen" | 315 |
  9. | "拥抱心中的梦想" | 296 |
  10. | "powerzhuye" | 260 |
  11. | "zhennann" | 237 |
  12. | "Mockplus" | 230 |
  13. | "MarvinZhang" | 184 |
  14. | "清蒸不是水煮" | 176 |
  15. | "一颗香菜" | 154 |
  16. | "断天涯大虾" | 152 |
  17. | "爱原型爱设计" | 151 |
  18. | "闻人的技术博客" | 141 |
  19. | "bytedance" | 139 |
  20. | "Bug开发者" | 136 |
  21. | "江昪" | 123 |
  22. | "NervosNetwork" | 118 |
  23. +--------------------------+
  24. 20 rows available after 23 ms, consumed after another 35454 ms

最短路径分析

neo4j 提供了方便快捷的查询给定2个节点之间最短路径函数shortestPath,通过制定

  • 寻找 saiwaiyanyu 和 阴明 在9度关系内的最短路径。 ```bash // 最短路径 MATCH path = shortestPath((u1:User{id:”5a8d0c5df265da4e9d223ba5”})-[:follow*1..9]-(u2:User{id:”551d677ee4b0cd5b623f49cb”})) RETURN path,u1.username,u2.username LIMIT 1;
  1. ![image.png](https://cdn.nlark.com/yuque/0/2021/png/265643/1625537061945-0568b22d-52a8-44e9-a316-6106a13d9dd8.png#clientId=u83db8637-720f-4&from=paste&id=ua7a2234b&margin=%5Bobject%20Object%5D&name=image.png&originHeight=61&originWidth=226&originalType=url&ratio=1&size=10733&status=done&style=none&taskId=u5ce368b3-4ed6-4529-a8a9-9803e0bec9c)
  2. - 寻找 saiwaiyanyu 头条搜索前端团队 的字段路径
  3. ```bash
  4. // 最短路径
  5. MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"5de0f0a0f265da06113c700f"}))
  6. RETURN path,u1.username,u2.username
  7. LIMIT 1

image.png

最长径分析

通过shortestPath能够方便查询给定点之间的最短关系,如果需要查询比较长的关系路径,该如何实现呢。

  • 查询存在6度关系的节点: ```bash

// 最长路径存在6度关系 MATCH path = (u1:User)-[:follow*6..6]-(u2:User) WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2) RETURN path LIMIT 1;

  1. ![image.png](https://cdn.nlark.com/yuque/0/2021/png/265643/1625537061972-c9810e48-0b73-48be-8619-c5a51f813779.png#clientId=u83db8637-720f-4&from=paste&id=uf7c89765&margin=%5Bobject%20Object%5D&name=image.png&originHeight=185&originWidth=538&originalType=url&ratio=1&size=39559&status=done&style=none&taskId=u155e4675-50a5-4d49-8577-8a952da817e)<br />在茫茫人海中,任何2个用户总能找到他们的关联,无论是隔壁邻居的八大姨的三外甥的同学。。。
  2. - 查询存在6度关系的节点,且指定follow方向:
  3. ```bash
  4. // 最长路径存在6度关系
  5. MATCH path = (u1:User)<-[:follow*6..6]-(u2:User)
  6. WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
  7. RETURN path
  8. LIMIT 1
  9. ;

image.png
这一条粉丝关系路径算是比较长了。

  • 粉丝关系形成闭环的情形
    1. // 最长路径MATCH (u:User) WHERE NOT EXISTS( (u)<-[:follow*1..6]-(u) )
    2. MATCH path = (u)<-[:follow*7..7]-(u)
    3. WHERE ALL(u IN [u0 IN FILTER(n IN NODES(path) WHERE ID(n)<>ID(u)) | size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
    4. RETURN u.username,path, LENGTH(path)
    5. LIMIT 1
    6. ;

image.png
follow关系层层传递,发现回到原点 ^_^ .

关于用户推荐

推荐算法比较经典的有基于用户和基于物品,以及今年应用越来越广的基于深度学习模型的推荐。
因采集到的用户数据维度较少,用户的文章、沸点、行为数据等缺失,因此计算用户相似度等存在一些问题。尝试用用户关注关系实现一个简单的推荐。

  • 关注他的用户也关注了

给定用户,通过查询该用户粉丝关注的其他用户,查看粉丝群的共同关注。

  1. MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
  2. MATCH (u0)<-[:follow]-(u1:User)-[:follow]->(g:User)
  3. WHERE NOT ID(u0)=ID(g)
  4. WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum
  5. RETURN u0.username,u0.id, g.username,g.id, sameFollowNum
  6. ORDER BY sameFollowNum DESC
  7. LIMIT 20
  8. ;
u0.username g.username sameFollowNum
阴明 前端外刊评论 21304
阴明 李CHENGXI 21286
阴明 水墨寒 20161
阴明 稀土君 15042
阴明 liutao 10898
阴明 刘欣 9434
阴明 NeXT 8770
阴明 丁一 8741
阴明 雨神姥爷 8454
阴明 taotao.li 8392
阴明 闲鱼技术 8290
阴明 HollisChuang 8050
阴明 吆喝科技_Zoran 7810
阴明 腾讯云加社区 7554
阴明 杨昕霖 7391
阴明 超人汪小建 7172
阴明 可乐橙 7137
阴明 饿了么UED 7039
阴明 JJ-Ying 7039
阴明 stormzhangV 6910

从结果看,21304 用户既关注了阴明,同时也关注了前端外刊评论。

  • 和我一起关注的用户也关注了他 ```bash // 和类似的用户也关注了 MATCH(u0:User{id:’5a8d0c5df265da4e9d223ba5’}) MATCH (u0)-[:follow]->(:User)<-[:follow]-(u1)-[:follow]->(g:User) WHERE NOT (u0)-[:follow]->(g) WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum RETURN u0.username, g.username, sameFollowNum ORDER BY sameFollowNum DESC LIMIT 10 ;
  1. | u0.username | g.username | sameFolloweeNum | sameFollowNum |
  2. | --- | --- | --- | --- |
  3. | saiwaiyanyu | 稀土君 | 24 | 23476 |
  4. | saiwaiyanyu | 前端外刊评论 | 23 | 21760 |
  5. | saiwaiyanyu | CHENGXI | 23 | 21662 |
  6. | saiwaiyanyu | 水墨寒 | 23 | 20683 |
  7. | saiwaiyanyu | 腾讯云加社区 | 23 | 18890 |
  8. | saiwaiyanyu | liutao | 24 | 16694 |
  9. | saiwaiyanyu | 超人汪小建 | 24 | 13954 |
  10. | saiwaiyanyu | 闲鱼技术 | 23 | 12828 |
  11. | saiwaiyanyu | Java3y | 24 | 12746 |
  12. | saiwaiyanyu | 美团技术团队 | 24 | 12667 |
  13. 从结果看出有23476 用户,关注了 saiwaiyanyu关注的用户,且这23476也关注了 稀土君,可以尝试将稀土君推荐给saiwaiyanyu
  14. <a name="KoZbK"></a>
  15. ## 代码
  16. ```bash
  17. // 创建主键
  18. CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
  19. // 导入节点
  20. USING PERIODIC COMMIT 5000
  21. LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
  22. WITH row
  23. WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
  24. MERGE (u:User{id:info["id"]})
  25. ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
  26. ;
  27. // 导入关系
  28. USING PERIODIC COMMIT 5000
  29. LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
  30. WITH row
  31. WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
  32. UNWIND info["d"] AS elem
  33. WITH
  34. info["type"] AS type,
  35. CASE WHEN info["type"] = "Followee" THEN elem["followerId"] WHEN info["type"] = "Follower" THEN elem["follower"]["objectId"] END AS startId,
  36. CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN elem["followeeId"] END AS endId ,
  37. elem["createdAtString"] AS createdAtString,
  38. elem["updatedAtString"] AS updatedAtString
  39. MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
  40. MERGE(startNode)-[r:follow]->(endNode)
  41. ON CREATE SET r.createdAtString = createdAtString,
  42. r.updatedAtString = updatedAtString
  43. ;
  44. // 每月新增用户和关系情况
  45. MATCH (u) WHERE EXISTS(u.createdAt)
  46. WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
  47. WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
  48. MATCH ()<-[r]-()
  49. WITH userMap,r.createdAtString AS createdAtString
  50. WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
  51. WITH createdAtString, num AS followNum ,userMap[createdAtString] AS userNum
  52. ORDER BY createdAtString
  53. WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
  54. UNWIND RANGE(0,LENGTH(infoList)-1) AS index
  55. WITH infoList[index]["createdAtString"] AS createdAtString,
  56. infoList[index]["userNum"] AS userNum,
  57. infoList[index]["followNum"] AS followNum,
  58. apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
  59. apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
  60. RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
  61. ORDER BY createdAtString DESC
  62. LIMIT 10
  63. ;
  64. // 粉丝数、最早注册等统计, 每日新增粉丝
  65. MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
  66. WITH u,r.createdAtString AS createdAtString
  67. WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
  68. WITH u, MAX(num ) AS maxNum
  69. RETURN u.username,
  70. apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
  71. apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
  72. u.followersCount AS followersCount,
  73. u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
  74. maxNum
  75. // ORDER BY createdAt
  76. ORDER BY aveNum DESC
  77. // ORDER BY maxNum DESC
  78. LIMIT 10
  79. ;
  80. // 粉丝分布
  81. MATCH (u:User)
  82. WHERE EXISTS(u.followersCount)
  83. RETURN COUNT(u.followersCount) AS count,
  84. AVG(u.followersCount) AS ave,
  85. percentileDisc(u.followersCount, 0.5) AS `50%`,
  86. percentileDisc(u.followersCount, 0.75) AS `75%`,
  87. percentileDisc(u.followersCount, 0.90) AS `90%`,
  88. percentileDisc(u.followersCount, 0.95) AS `95%`,
  89. percentileDisc(u.followersCount, 0.99) AS `99%`,
  90. percentileDisc(u.followersCount, 0.999) AS `99.9%`,
  91. percentileDisc(u.followersCount, 0.9999) AS `99.99%`,
  92. percentileDisc(u.followersCount, 0.99999) AS `99.999%`,
  93. percentileDisc(u.followersCount, 1) AS `100%`
  94. ;
  95. // 粉丝占比
  96. MATCH (u:User)
  97. WHERE EXISTS(u.followersCount)
  98. WITH
  99. SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
  100. SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
  101. SUM(u.followersCount) AS followersCountSum
  102. RETURN low,high,followersCountSum,
  103. low*1.0/followersCountSum AS lowPercent,
  104. high*1.0/followersCountSum AS highPercent
  105. ;
  106. // pageRank 影响力挖掘分析
  107. MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
  108. CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
  109. WITH u0,algo.getNodeById(nodeId) AS u, score
  110. WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
  111. RETURN
  112. u.username AS username,
  113. score,hasRelation
  114. ORDER BY score DESC
  115. LIMIT 20
  116. ;
  117. // pageRank 影响力前10 的用户之间的关系
  118. CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
  119. WITH algo.getNodeById(nodeId) AS u, score
  120. WITH u,score ORDER BY score DESC LIMIT 10
  121. WITH COLLECT(u) AS Users
  122. UNWIND Users AS u1
  123. UNWIND Users AS u2
  124. MATCH p=(u1)-[r:follow]-(u2)
  125. RETURN p;
  126. // 互粉的情况
  127. MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
  128. MATCH (u1)-[:follow]->(u2)
  129. RETURN COUNT(DISTINCT u1.id+u2.id) AS num
  130. ;
  131. // 自我互粉的情况
  132. MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
  133. RETURN COUNT(DISTINCT u1.id) AS num
  134. ;
  135. // 互粉最多的用户
  136. MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
  137. MATCH (u1)-[:follow]->(u2)
  138. WITH u1,u2
  139. WITH u1,COUNT(u2) AS num
  140. RETURN u1.username,num
  141. ORDER BY num DESC
  142. LIMIT 20
  143. ;
  144. // 最长路径
  145. MATCH path = (u1:User)<-[:follow*9..9]-(u2:User)
  146. WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
  147. RETURN path, LENGTH(path)
  148. // ORDER BY LENGTH(path) DESC
  149. LIMIT 1
  150. ;
  151. // 发布文章最多的,发布沸点最多的
  152. MATCH (u:User) WHERE
  153. // EXISTS(u.pinCount)
  154. EXISTS(u.postedPostsCount)
  155. RETURN u.id AS id,
  156. u.username AS username,
  157. u.pinCount AS pinCount,
  158. u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS pinCountAve,
  159. u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS postedPostsCountAve,
  160. u.totalCollectionsCount AS totalCollectionsCount,
  161. u.postedPostsCount AS postedPostsCount,
  162. u.totalViewsCount AS totalViewsCount
  163. ORDER BY postedPostsCount DESC
  164. LIMIT 10
  165. ;
  166. // 最短路径
  167. MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"551d677ee4b0cd5b623f49cb"}))
  168. RETURN path,u1.username,u2.username
  169. LIMIT 1;
  170. // 最短路径
  171. MATCH path = shortestPath((u1:User{id:"5a8d0c5df265da4e9d223ba5"})-[:follow*1..9]-(u2:User{id:"5de0f0a0f265da06113c700f"}))
  172. RETURN path,u1.username,u2.username
  173. LIMIT 1
  174. ;
  175. // 最长路径
  176. MATCH path = (u1:User)-[:follow*6..6]-(u2:User)
  177. WHERE ALL(u IN [u0 IN NODES(path)| size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
  178. RETURN path
  179. LIMIT 1;
  180. // 最长路径
  181. MATCH (u:User) WHERE NOT EXISTS( (u)<-[:follow*1..6]-(u) )
  182. MATCH path = (u)<-[:follow*7..7]-(u)
  183. WHERE ALL(u IN [u0 IN FILTER(n IN NODES(path) WHERE ID(n)<>ID(u)) | size([m IN FILTER(f IN NODES(path) WHERE ID(f)=ID(u0))])] WHERE u<2)
  184. RETURN u.username,path, LENGTH(path)
  185. // ORDER BY LENGTH(path) DESC
  186. LIMIT 1
  187. ;
  188. // 关注他的用户也关注了
  189. MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
  190. MATCH (u0)<-[:follow]-(u1:User)-[:follow]->(g:User)
  191. WHERE NOT ID(u0)=ID(g)
  192. WITH u0, g, COUNT(DISTINCT u1) AS sameFollowNum
  193. RETURN u0.username,u0.id, g.username,g.id, sameFollowNum
  194. ORDER BY sameFollowNum DESC
  195. LIMIT 20
  196. ;
  197. // 和类似的用户也关注了
  198. MATCH(u0:User{id:'5a8d0c5df265da4e9d223ba5'})
  199. MATCH (u0)-[:follow]->(s:User)<-[:follow]-(u1)-[:follow]->(g:User)
  200. WHERE NOT (u0)-[:follow]->(g)
  201. WITH u0, g,COUNT(DISTINCT s) AS sameFolloweeNum, COUNT(DISTINCT u1) AS sameFollowNum
  202. RETURN u0.username, g.username, sameFolloweeNum,sameFollowNum
  203. ORDER BY sameFollowNum DESC
  204. LIMIT 10
  205. ;


作者:saiwaiyanyu
链接:https://juejin.cn/post/6844904020440514574
来源:掘金
分析代码:github.com/saiwaiyanyu…