本次我们的作业都围绕这个数据来展开,请大家在这里下载:
下载地址1:http://106.75.65.117:12345/xly/weibo_freshdata.2020-04-10.7z
下载地址2:http://117.50.101.146:12345/xly/weibo_freshdata.2020-04-10.7z

大家用7z解压,然后head -20 这个文件,可以看到这个文件的数据格式:
image.png
这里面每条记录是一行,每个数据中间\t隔开(tab键),以第一条记录举例
第一个字段是数据id,每条数据都有唯一编号,都是自增1的,没有重复。例如第一条记录是44168870965

第二个字段是微博用户id,例如第一条记录是5789459810

第三个字段是微博用户昵称,例如第一条记录是andforgettheworld

第四个字段是微博文章id,每个用户发出的每一条微博文章都有一个唯一不重复的id,例如第一条记录是4491937342273181

第五个字段是转发数,例如第一条记录是0

第六个字段是评论数,例如第一条记录是0

那么这第一条记录如果用文字描述就是:编号44168870965记录了一个微博用户(andforgettheworld),其用户id是5789459810 ,发布了一条微博文章id是(4491937342273181)的微博,这条微博的转发数是0,评论数是0

进一步理解这个数据,由于一条数据可能多次抓取,不同抓取时间评论量和转发量不完全相同。那么我们会多次记录同一个微博id多次的转发量,评论量,这样就会在数据中出现weiboid的重复。

例如 grep 4491900012823164 weibo_freshdata.2020-04-10 就会发现人民日报发布的这条微博(4491900012823164),被记录了多次。转发量越来越高直到13618,评论量也是越来越高,直到1993。

再进一步理解这个数据,我们这个是一天之内的微博数据,其中可能有用户改了昵称,也就是用户id是不变的,用户昵称可能会变化。
例如 grep 1619787575 weibo_freshdata.2020-04-10 就会发现微博用户(1619787575)在一天之内改了3次昵称,分别是 喵喵王0609 喵喵的月光曲 和 喵喵的小夜曲。

综合以上认知,我们得到以下结论

  1. 微博用户id是保持不变的,但是昵称可能会多次变化

  2. 微博的文章id可能会在一天记录多次

  3. 转发数和评论数在不同时间记录,数值会发生变化

练习1,用awk写出相应的脚本过程,回答下面问题

  1. 找到这有天转发量最大的微博,这条微博是谁发布的?记录id是多少?
    ```shell

    数据清洗,只留下有完整6列的数据

    gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ awk ‘NF==6 {print $0}’ weibo_freshdata.2020-04-10 > cleanWeiboData

    比较清洗前和清洗后的数据行数

    gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ awk ‘END {print NR}’ weibo_freshdata.2020-04-10 33227542 gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ awk ‘END {print NR}’ cleanWeiboData 33227517

使用awk找出当天转发量最大的微博

gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ awk ‘BEGIN { printf(“data_id\t\tuser_id\t\tusername\tweibo_id\t\t转发量\n”) } $5 > maxRepost { maxLine = $0; maxRepost = $5 } END { print maxLine }’ cleanWeiboData data_id user_id username weibo_id 转发量 44198888474 1792951112 X玖少年团肖战DAYTOY 4492118670464947 1000000 1000000

  1. ```sql
  2. -- 为了作为awk结果的对照,将数据导入到MySQL中并计算出结果,来验证awk程序的正确性
  3. -- 创建表
  4. mysql> create table weibo_data(id bigint, user_id varchar(255), nick_name varchar(255), weibo_id bigint, repost_count bigint, comment_count bigint, primary key(id));
  5. Query OK, 0 rows affected (0.10 sec)
  6. -- 获取默认的文件导入的目录
  7. -- 在另一个命令行行文件拷贝到此目录 sudo cp cleanWeiboData /var/lib/mysql-files/
  8. mysql> show variables like '%secure%';
  9. +--------------------------+-----------------------+
  10. | Variable_name | Value |
  11. +--------------------------+-----------------------+
  12. | require_secure_transport | OFF |
  13. | secure_file_priv | /var/lib/mysql-files/ |
  14. +--------------------------+-----------------------+
  15. 2 rows in set (0.01 sec)
  16. mysql> load data infile '/var/lib/mysql-files/cleanWeiboData' into table weibo_data fields terminated by '\t' lines terminated by '\n';
  17. Query OK, 33227517 rows affected (3 min 51.11 sec)
  18. Records: 33227517 Deleted: 0 Skipped: 0 Warnings: 0
  19. mysql> select * from weibo_data limit 0, 10;
  20. +-------------+------------+---------------------+------------------+--------------+---------------+
  21. | id | user_id | nick_name | weibo_id | repost_count | comment_count |
  22. +-------------+------------+---------------------+------------------+--------------+---------------+
  23. | 44168870965 | 5789459810 | andforgettheworld | 4491937342273181 | 0 | 0 |
  24. | 44168870966 | 6043363363 | ZHOU珍珍珍珍珍 | 4491933394876299 | 0 | 0 |
  25. | 44168870967 | 6006345628 | 云深十三载 | 4491813652077725 | 0 | 0 |
  26. | 44168870968 | 1145846142 | mmmmmwu23 | 4491720819138604 | 0 | 0 |
  27. | 44168870969 | 2350209083 | Ms李小坏 | 4491900935920036 | 0 | 0 |
  28. | 44168870970 | 2350209083 | Ms李小坏 | 4491739303975460 | 0 | 0 |
  29. | 44168870971 | 6208135038 | 马路牙子_- | 4491855154937696 | 0 | 0 |
  30. | 44168870972 | 6208135038 | 马路牙子_- | 4491847319578445 | 0 | 4 |
  31. | 44168870973 | 6208135038 | 马路牙子_- | 4491831553243817 | 0 | 0 |
  32. | 44168870974 | 6208135038 | 马路牙子_- | 4491786510648701 | 0 | 0 |
  33. +-------------+------------+---------------------+------------------+--------------+---------------+
  34. 10 rows in set (0.00 sec)
  35. -- 使用sql找出当天转发量最大的微博
  36. mysql> select id, user_id, nick_name,weibo_id,repost_count from weibo_data as t where repost_count=(select max(repost_count) from weibo_data);
  37. +-------------+------------+---------------------------+------------------+--------------+
  38. | id | user_id | nick_name | weibo_id | repost_count |
  39. +-------------+------------+---------------------------+------------------+--------------+
  40. | 44198888474 | 1792951112 | X玖少年团肖战DAYTOY | 4492118670464947 | 1000000 |
  41. +-------------+------------+---------------------------+------------------+--------------+
  42. 1 row in set (17.40 sec)
  1. 找到转发量最大的前10条微博,评论量最大的前10条微博,记录id是多少,都是谁发布的?
    ```shell

    转发量前10

— 未根据weibo_id去重 gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | sort -rn -k5 | awk ‘BEGIN { printf(“data_id\t\tuser_id\t\tusername\tweibo_id\t\trepost_count\n”) } NR<=10 {print $1,$2,$3,$4,$5}’ data_id user_id username weibo_id repost_count 44198888474 1792951112 X玖少年团肖战DAYTOY 4492118670464947 1000000 44201114459 2656274875 央视新闻 4491769939216991 839350 44202033672 1776448504 蔡徐坤 4491775148673510 803749 44198031114 5492443184 UNIQ-王一博 4492135753830523 786617 44191057923 1776448504 蔡徐坤 4491775148673510 592565 44190414327 1776448504 蔡徐坤 4491775148673510 587177 44189129227 1776448504 蔡徐坤 4491775148673510 575744 44172765241 2656274875 央视新闻 4491769939216991 574612 44187289148 1776448504 蔡徐坤 4491775148673510 565045 44199701438 1624923463 华晨宇yu 4492235121277302 557271

real 1m25.320s user 1m4.040s sys 0m6.145s

— 根据weibo_id去重 gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk ‘{print $2,$3,$4,$5}’ cleanWeiboData | LC_ALL=C sort -nr -k4 | awk ‘!a[$3]++{print}’ | head -10 1792951112 X玖少年团肖战DAYTOY 4492118670464947 1000000 2656274875 央视新闻 4491769939216991 839350 1776448504 蔡徐坤 4491775148673510 803749 5492443184 UNIQ-王一博 4492135753830523 786617 1624923463 华晨宇yu 4492235121277302 557271 1776448504 蔡徐坤 4491774955539019 529550 2803301701 人民日报 4491769431856516 417781 6161856218 R1SE-周震南 4492122809840115 383303 1645578093 南风窗 4491886414393947 383274 1273610165 ELLE 4491761814571777 337600

real 0m52.814s user 0m32.054s sys 0m6.328s

评论量前10

— 未根据weibo_id去重 gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | sort -rn -k6 | awk ‘BEGIN { printf(“data_id\t\tuser_id\t\tusername\tweibo_id\t\tcomment_count\n”) } NR<=10 {print $1,$2,$3,$4,$6}’ data_id user_id username weibo_id comment_count 44190480982 3549916270 中国消防 4491813736120269 2421638 44188014799 3549916270 中国消防 4491813736120269 2421564 44169530571 3549916270 中国消防 4491813736120269 2421454 44198888474 1792951112 X玖少年团肖战DAYTOY 4492118670464947 1000000 44202033672 1776448504 蔡徐坤 4491775148673510 379997 44202033673 1776448504 蔡徐坤 4491774955539019 310893 44191057923 1776448504 蔡徐坤 4491775148673510 265234 44190414327 1776448504 蔡徐坤 4491775148673510 262927 44189129227 1776448504 蔡徐坤 4491775148673510 257990 44187289148 1776448504 蔡徐坤 4491775148673510 251248

real 1m30.281s user 1m3.488s sys 0m7.359s

— 根据weibo_id去重 gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk ‘{print $2,$3,$4,$6}’ cleanWeiboData | LC_ALL=C sort -nr -k4 | awk ‘!a[$3]++{print}’ | head -10 3549916270 中国消防 4491813736120269 2421638 1792951112 X玖少年团肖战DAYTOY 4492118670464947 1000000 1776448504 蔡徐坤 4491775148673510 379997 1776448504 蔡徐坤 4491774955539019 310893 5492443184 UNIQ-王一博 4492135753830523 216299 2057392851 广西消防 4491813706611014 169244 1826513532 炎亞綸_AaronYan 4491818554828901 126104 1624923463 华晨宇yu 4492235121277302 116820 1645578093 南风窗 4491886414393947 81901 1860563805 罗云熙Leo 4492103340000895 75554

real 0m51.370s user 0m33.702s sys 0m6.309s

  1. ```sql
  2. -- 转发量前10
  3. -- 根据weibo_id去重
  4. mysql> select distinct weibo_id from (select id,user_id,nick_name,weibo_id,repost_count from weibo_data order by repost_count desc limit 0,100) as t limit 0,10;
  5. +------------------+
  6. | weibo_id |
  7. +------------------+
  8. | 4492118670464947 |
  9. | 4491769939216991 |
  10. | 4491775148673510 |
  11. | 4492135753830523 |
  12. | 4492235121277302 |
  13. | 4491774955539019 |
  14. | 4491769431856516 |
  15. | 4492122809840115 |
  16. | 4491886414393947 |
  17. | 4491761814571777 |
  18. +------------------+
  19. 10 rows in set (11.30 sec)
  20. -- 未根据weibo_id去重
  21. mysql> select id,user_id,nick_name,weibo_id,repost_count from weibo_data order by repost_count desc limit 0,10;
  22. +-------------+------------+---------------------------+------------------+--------------+
  23. | id | user_id | nick_name | weibo_id | repost_count |
  24. +-------------+------------+---------------------------+------------------+--------------+
  25. | 44198888474 | 1792951112 | X玖少年团肖战DAYTOY | 4492118670464947 | 1000000 |
  26. | 44201114459 | 2656274875 | 央视新闻 | 4491769939216991 | 839350 |
  27. | 44202033672 | 1776448504 | 蔡徐坤 | 4491775148673510 | 803749 |
  28. | 44198031114 | 5492443184 | UNIQ-王一博 | 4492135753830523 | 786617 |
  29. | 44191057923 | 1776448504 | 蔡徐坤 | 4491775148673510 | 592565 |
  30. | 44190414327 | 1776448504 | 蔡徐坤 | 4491775148673510 | 587177 |
  31. | 44189129227 | 1776448504 | 蔡徐坤 | 4491775148673510 | 575744 |
  32. | 44172765241 | 2656274875 | 央视新闻 | 4491769939216991 | 574612 |
  33. | 44187289148 | 1776448504 | 蔡徐坤 | 4491775148673510 | 565045 |
  34. | 44199701438 | 1624923463 | 华晨宇yu | 4492235121277302 | 557271 |
  35. +-------------+------------+---------------------------+------------------+--------------+
  36. 10 rows in set (14.17 sec)
  37. -- 评论量前10
  38. -- 根据weibo_id去重
  39. mysql> select distinct weibo_id from (select id,user_id,nick_name,weibo_id,comment_count from weibo_data order by comment_count desc limit 0,100) as t limit 0,10;
  40. +------------------+
  41. | weibo_id |
  42. +------------------+
  43. | 4491813736120269 |
  44. | 4492118670464947 |
  45. | 4491775148673510 |
  46. | 4491774955539019 |
  47. | 4492135753830523 |
  48. | 4491813706611014 |
  49. | 4491818554828901 |
  50. | 4492235121277302 |
  51. | 4491886414393947 |
  52. | 4492103340000895 |
  53. +------------------+
  54. 10 rows in set (11.43 sec)
  55. -- 未根据weibo_id去重
  56. mysql> select id,user_id,nick_name,weibo_id,comment_count from weibo_data order by comment_count desc limit 0,10;
  57. +-------------+------------+---------------------------+------------------+---------------+
  58. | id | user_id | nick_name | weibo_id | comment_count |
  59. +-------------+------------+---------------------------+------------------+---------------+
  60. | 44190480982 | 3549916270 | 中国消防 | 4491813736120269 | 2421638 |
  61. | 44188014799 | 3549916270 | 中国消防 | 4491813736120269 | 2421564 |
  62. | 44169530571 | 3549916270 | 中国消防 | 4491813736120269 | 2421454 |
  63. | 44198888474 | 1792951112 | X玖少年团肖战DAYTOY | 4492118670464947 | 1000000 |
  64. | 44202033672 | 1776448504 | 蔡徐坤 | 4491775148673510 | 379997 |
  65. | 44202033673 | 1776448504 | 蔡徐坤 | 4491774955539019 | 310893 |
  66. | 44191057923 | 1776448504 | 蔡徐坤 | 4491775148673510 | 265234 |
  67. | 44190414327 | 1776448504 | 蔡徐坤 | 4491775148673510 | 262927 |
  68. | 44189129227 | 1776448504 | 蔡徐坤 | 4491775148673510 | 257990 |
  69. | 44187289148 | 1776448504 | 蔡徐坤 | 4491775148673510 | 251248 |
  70. +-------------+------------+---------------------------+------------------+---------------+
  71. 10 rows in set (13.63 sec)
  1. 数据集中记录的所有发过微博文章的用户有多少个?
    ```shell gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | awk ‘{print $2}’ | sort | uniq -c | wc -l 6550411

real 0m56.054s user 0m57.097s sys 0m3.635s

  1. ```sql
  2. mysql> select count(distinct user_id) from weibo_data;
  3. +-------------------------+
  4. | count(distinct user_id) |
  5. +-------------------------+
  6. | 6550411 |
  7. +-------------------------+
  8. 1 row in set (1 min 52.24 sec)
  1. 转发数和评论数的总和超过1000的微博有多少条(要注意对微博id去重)
    ```shell gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | awk ‘$5+$6>1000 {print $4}’ | sort | uniq -c | wc -l 10932

real 0m8.616s user 0m5.719s sys 0m1.729s

  1. ```sql
  2. mysql> select count(distinct weibo_id) from weibo_data where repost_count+comment_count>1000;
  3. +--------------------------+
  4. | count(distinct weibo_id) |
  5. +--------------------------+
  6. | 10932 |
  7. +--------------------------+
  8. 1 row in set (6.35 sec)
  9. -- 此语句效率太低
  10. mysql> select count(*) from (select max(repost_count) as repost_max, max(comment_count) as comment_max from weibo_data group by weibo_id having repost_max+comment_max>1000) as t;
  11. +----------+
  12. | count(*) |
  13. +----------+
  14. | 10932 |
  15. +----------+
  16. 1 row in set (6 min 17.77 sec)
  1. 发布转发数和评论数总和超过1000的微博用户有多少个? ```shell gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk ‘$5+$6>1000 {print $2}’ cleanWeiboData | sort -k1 -nr | uniq | wc -l 6283

real 0m7.112s user 0m5.791s sys 0m0.693s

— 不使用sort gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk ‘$5+$6>1000 {print $2}’ cleanWeiboData | awk ‘!a[$0]++ {print}’ | wc -l 6283

real 0m6.662s user 0m5.721s sys 0m0.806s

  1. 6. 所以记录中,微博id如果重复只取最大的转发数和最大的评论数。那么这一天的所有数据中,总的转发量是多大?,总的评论量是多大?
  2. 例如:4491900012823164
  3. 这条人民日报发布的微博,转发量用13618来进行统计;评论量用1993来进行统计
  4. ```shell
  5. # 总转发量
  6. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '$5>0 {print $0}' cleanWeiboData | LC_ALL=C sort -rnk4 -nrk5 | awk 'BEGIN {tmp=0;sum=0} tmp!=$4 {tmp=$4;sum+=$5} END{print sum}'
  7. 64582729
  8. real 0m9.001s
  9. user 0m9.621s
  10. sys 0m1.925s
  11. # 总评论量
  12. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '$6>0 {print $0}' cleanWeiboData | LC_ALL=C sort -rnk4 -nrk6 | awk 'BEGIN {tmp=0;sum=0} tmp!=$4 {tmp=$4;sum+=$6} END{print sum}'
  13. 50014190
  14. real 0m13.250s
  15. user 0m14.639s
  16. sys 0m2.952s
  17. # 优化前的几次失败尝试
  18. # 使用数组去重耗时较大
  19. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '$5>0 {print $0}' cleanWeiboData | LC_ALL=C sort -rnk4 -rnk5 | awk '!a[$4]++ {sum+=$5} END{print sum}'
  20. 64582729
  21. real 0m32.831s
  22. user 0m34.501s
  23. sys 0m2.927s
  24. # 未过滤转发数为0的微博(指定内存)
  25. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | LC_ALL=C sort -S 30% -rnk4 -nrk5 | awk 'BEGIN {tmp=0;sum=0} tmp!=$4 {tmp=$4;sum+=$5} END{print sum}'
  26. 64582729
  27. real 0m46.639s
  28. user 1m31.297s
  29. sys 0m5.187s
  30. # 不指定内存
  31. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time cat cleanWeiboData | LC_ALL=C sort -rnk4 -nrk5 | awk 'BEGIN {tmp=0;sum=0} tmp!=$4 {tmp=$4;sum+=$5} END{print sum}'
  32. 64582729
  33. real 1m17.305s
  34. user 1m3.241s
  35. sys 0m10.438s
  1. -- 首先验证人民日报那条微博的数据
  2. mysql> select weibo_id, max(repost_count) as repost_max, max(comment_count) as comment_max from weibo_data where weibo_id=4491900012823164 group by weibo_id;
  3. +------------------+------------+-------------+
  4. | weibo_id | repost_max | comment_max |
  5. +------------------+------------+-------------+
  6. | 4491900012823164 | 13618 | 1993 |
  7. +------------------+------------+-------------+
  8. 1 row in set (6.15 sec)
  9. mysql> select sum(repost_max) as repost_sum_all,sum(comment_max) as comment_sum_all from (select weibo_id, max(repost_count) as repost_max, max(comment_count) as comment_max from weibo_data group by weibo_id) as t;
  10. +----------------+-----------------+
  11. | repost_sum_all | comment_sum_all |
  12. +----------------+-----------------+
  13. | 64582729 | 50014190 |
  14. +----------------+-----------------+
  15. 1 row in set (3 min 21.37 sec)
  1. 找到所有微博用户在这一天修改过昵称,并按照修改次数从大到小排列,并列出,都使用过哪些昵称。例如:

第一列是修改次数,第二列是微博用户id,第三列以后是曾经用过的用户昵称,如下图:
image.png

  1. gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '{print $2,$3}' cleanWeiboData | LC_ALL=C sort -rnk1 -k2 | awk 'BEGIN{user_id=0;changes=1;name="";names=""} {if(user_id==$1) {if(name!=$2) {changes+=1;names=names"\t"$2} user_id=$1;name=$2} else {if(changes>1) {print changes,user_id,names} user_id=$1;name=$2;changes=1;names=$2}}' | sort -rnk1 -nk2 | head -10
  2. 3 7279112618 任小希Allen 任嘉伦的灿希儿 Allen-灿希儿
  3. 3 7131367461 樱雾JKjk 体育弟弟 JK樱雾
  4. 3 7021000672 7021000672_444 上海夜场ktv模特领队轩伟 上海夜场ktv模特总监轩伟
  5. 3 6989447609 娱圈芙楠 圈娱芙楠 YZ_度楠
  6. 3 6880318695 肖个别同学 独角兽会摔跤 守护兔鹅瓜的红红
  7. 3 6483135255 清酒玫瑰- 丨清酒玫瑰 -清酒玫瑰-
  8. 3 6299342317 欧气的莫莫尼 欧气满满的锦鲤莫莫尼耶 欧气满满的莫莫尼耶
  9. 3 6282180382 头给你拧掉呢 你永远得不到的爸爸耶 DirtyOO999
  10. 3 6118404813 鸠瑶DAYTOY 鸠瑶-呆桃 鸠瑶
  11. 3 5533178626 大头星人她很酷 Jiiiii咔嘣脆脆鲨 Biiiii咔嘣脆脆鲨
  12. real 0m54.121s
  13. user 0m53.150s
  14. sys 0m6.786s
-- 一说正确结果是1494条,那是因为在数据清洗的时候是把每一行的字段数大于等于6的全部算正确的,其实有4条数据的字段数大于6
mysql> select count(*) from (select count(distinct nick_name) as nick_name_changes, user_id from weibo_data group by user_id having nick_name_changes>1) as t;
+----------+
| count(*) |
+----------+
|     1490 |
+----------+
1 row in set (9 min 59.70 sec)

mysql> select count(distinct nick_name) as nick_name_changes, user_id from weibo_data group by user_id having nick_name_changes>1 order by nick_name_changes desc limit 0,20;
+-------------------+------------+
| nick_name_changes | user_id    |
+-------------------+------------+
|                 3 | 1619787575 |
|                 3 | 1803176071 |
|                 3 | 2149595787 |
|                 3 | 2629769851 |
|                 3 | 3029140585 |
|                 3 | 5533178626 |
|                 3 | 6118404813 |
|                 3 | 6282180382 |
|                 3 | 6299342317 |
|                 3 | 6483135255 |
|                 3 | 6880318695 |
|                 3 | 6989447609 |
|                 3 | 7021000672 |
|                 3 | 7131367461 |
|                 3 | 7279112618 |
|                 2 | 1001732182 |
|                 2 | 1005043822 |
|                 2 | 1012538504 |
|                 2 | 1013091200 |
|                 2 | 1039726134 |
+-------------------+------------+
20 rows in set (7 min 19.62 sec)
  1. 发现用户昵称长度的分布规律,统计不同字数的用户人数。得到如下数据中间\t隔开。用户修改昵称的情况不用考虑,否则会很复杂,多个昵称的用户,任意选择一个昵称来进行统计即可。

格式:字数 \t 用户数,例如
image.png
其中5个字的用户最多有98万多人。

# 如果尝试设置各种编码结果还是不对,安装gawk
gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '{print $2,$3}' cleanWeiboData | LC_ALL=C sort -rnk1 | awk 'BEGIN {tmp=""} tmp!=$1 {tmp=$1; print length($2)}' | awk '{num[$1]++} END{for(k in num) print k"\t"num[k]}' | sort -rnk2 | head -10
5    982286
7    974177
6    934853
4    857149
8    706429
9    509443
10    377528
3    330249
11    258180
12    223903

real    0m50.910s
user    0m47.907s
sys    0m3.512s
  1. 按照发文数量做一个发文用户排行榜,按照发文数量倒叙排列,第一列是用户id,第二列是发文数量。

注意要对weiboid去重。例如一个用户只发了一条微博,但是由于我们数据记录了多次,从而出现了weiboid的重复,需要去掉这种重复后,判断用户真正发文的数量。例如:
数据中用户id是123,456分别发布了weiboid为 1122,2233,3344,5566,7788,这5条,但是数据中由于weiboid的重复记录出现如下情况
123 1122
456 2233
123 3344
456 2233
456 5566
456 7788
123 3344
其中红色和绿色部分为记录中的重复,需要去掉其中的一次。实际上,123这个用户只发布了1122,3344这两条微博,而456,只发布了2233,5566,7788。那么统计的结果应该去重后的结果:
123 2
456 3

而不是
123 3
456 4
(这里如果还没整明白可以找助教和我问)

gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '{print $2,$4}' cleanWeiboData | LC_ALL=C sort -nk1 -nk2 -u | awk 'BEGIN{tmp;count=0} {if(tmp!=0 && $1!=tmp) {print tmp,count;count=0;} tmp=$1;count++} END{print tmp,count}' | sort -rnk2 | head -10
6331476961 32072
6997634787 14631
6550222341 10777
7047271339 10768
7348497948 10000
6497176672 9735
6542428700 8358
6456772530 7663
6516786482 7467
6666331937 7440

real    0m58.484s
user    1m2.821s
sys     0m5.799s
-- 运行时间太长,待优化
mysql> select user_id, weibo_count from (select user_id, count(distinct weibo_id) as weibo_count from weibo_data group by user_id) as t order by weibo_count desc limit 0, 20;
+------------+-------------+
| user_id    | weibo_count |
+------------+-------------+
| 6331476961 |       32072 |
| 6997634787 |       14631 |
| 6550222341 |       10777 |
| 7047271339 |       10768 |
| 7348497948 |       10000 |
| 6497176672 |        9735 |
| 6542428700 |        8358 |
| 6456772530 |        7663 |
| 6516786482 |        7467 |
| 6666331937 |        7440 |
| 6681027896 |        6840 |
| 6499996656 |        6260 |
| 7034362395 |        5655 |
| 7312935480 |        5034 |
| 6207529154 |        5011 |
| 6522040326 |        4808 |
| 7302170015 |        4745 |
| 5057339785 |        4675 |
| 5056769900 |        4617 |
| 6637410219 |        4524 |
+------------+-------------+
20 rows in set (9 min 25.66 sec)
  1. 第四题的输出结果中改成第一列是用户id,第二列是用户昵称,第三列是发文数量。(选作,时间来不及可不做)

注意用户id可能有多个昵称,任意选择一个昵称即可。

gauss@gauss-XPS-15-9570:~/workspace/eightFriends$ time awk '$5+$6>1000 {print $2,$3,$4}' cleanWeiboData | LC_ALL=C sort -nk1 -nk3 -u | awk '{count[$1]++;name[$1]=$2} END{for(i in count){print i,name[i],count[i]}}' | sort -rnk3 | head -10
2803301701 人民日报 84
2482557597 韩国me2day 62
2656274875 央视新闻 55
5583227444 千家无名氏 52
1742566624 思想聚焦 51
1216826604 wu2198 42
1618051664 头条新闻 39
5103645868 星闻揭秘 38
5837848666 华晨宇onepiece战斗组 36
1886903325 芒果娱乐 36

real    0m6.016s
user    0m5.620s
sys     0m0.424s