什么是spark-shell

Spark-shell 是 Spark 给我们提供的交互式命令窗口(类似于 Scala 的 REPL)

准备数据

新建个demo文件夹
在demo文件夹里面新建两个txt文件
里面的数据都是用空格分隔的, 不是用tab分隔的

  1. [root@zjj101 soft]# cat demo1.txt
  2. zjj AAA
  3. BBB AAA
  4. jj ZJJ
  5. [root@zjj101 soft]# cat demo2.txt
  6. ZZZ zzz
  7. xxx yyy
  8. Zjj
  9. zjj
  10. [root@zjj101 soft]#

执行sparkShell

在spark根目录下执行bin/spark-shell 命令

  1. [root@zjj101 spark-2.1.1-bin-hadoop2.7]# bin/spark-shell

控制台输出结果

  1. [root@zjj101 spark-2.1.1-bin-hadoop2.7]# bin/spark-shell
  2. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  3. Setting default log level to "WARN".
  4. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  5. 20/10/29 17:24:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  6. 20/10/29 17:24:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
  7. 20/10/29 17:24:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
  8. 20/10/29 17:24:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
  9. # web地址
  10. Spark context Web UI available at http://172.16.10.101:4040
  11. # 新建了个session,名字是sc , master类型是本地模式,指定cpu核心数资源是计算机的最大cpu内核.
  12. Spark context available as 'sc' (master = local[*], app id = local-1603963471476).
  13. Spark session available as 'spark'.
  14. Welcome to
  15. ____ __
  16. / __/__ ___ _____/ /__
  17. _\ \/ _ \/ _ `/ __/ '_/
  18. /___/ .__/\_,_/_/ /_/\_\ version 2.1.1
  19. /_/
  20. Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
  21. Type in expressions to have them evaluated.
  22. Type :help for more information.

浏览器访问 http://172.16.10.101:4040 :
Spark-shell执行wordcount案例 - 图1

开始执行wordcount程序

在spark内部执行

  1. scala> sc.textFile("../demo/").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).collect
  2. res0: Array[(String, Int)] = Array((zjj,2), (jj,1), (BBB,1), (xxx,1), (zzz,1), (Zjj,1), (ZZZ,1), (ZJJ,1), (yyy,1), (AAA,2))

发现已经做了单词统计

wordcount提交流程

Spark-shell执行wordcount案例 - 图2
Spark-shell执行wordcount案例 - 图3
1.textFile(“input”):读取本地文件input文件夹数据;
2.flatMap(.split(“ “)):压平操作,按照空格分割符将一行数据映射成一个个单词;
3.map((
,1)):对每一个元素操作,将单词映射为元组;
4.reduceByKey(+):按照key将值进行聚合,相加;
5.collect:将数据收集到Driver端展示.前面的所有的操作是懒加载,不会立即执行,调用collect之后就开始执行了.