部署

主机规划

NN JN ZKFC ZK DN RM NM hive server hive cli spark sql spark server2 spark beeline
node01 * * * * *
node02 * * * * * * * * *
node03 * * * * *
node04 * * * *

原来的hadoop机群RM在node03和node04

Standalone

master node01 node02
worker node02 node03 node04

  1. node01
  2. cd /opt
  3. tar xzvf spark-2.3.4-bin-hadoop2.6.tgz
  4. mv spark-2.3.4-bin-hadoop2.6 spark-2.3.4
  5. 配置环境变量
  6. cd /opt/spark-2.3.4/conf
  7. //配置works
  8. mv slaves.template slaves
  9. vi slaves
  10. node02
  11. node03
  12. node04
  13. //环境参数
  14. spark-env.sh
  15. export HADOOP_CONF_DIR=/opt/bigdata/hadoop-2.6.5/etc/hadoop
  16. export SPARK_MASTER_HOST=node01 # 每台master改成自己的主机名
  17. export SPARK_MASTER_PORT=7077
  18. export SPARK_MASTER_WEBUI_PORT=8080
  19. export SPARK_WORKER_CORES=4
  20. export SPARK_WORKER_MEMORY=4g
  21. //配置高可用
  22. cp spark-defaults.conf.template spark-defaults.conf
  23. vi spark-defaults.conf
  24. spark.deploy.recoveryMode ZOOKEEPER
  25. spark.deploy.zookeeper.url node02:2181,node03:2181,node04:2181
  26. spark.deploy.zookeeper.dir /bryanspark
  27. spark.eventLog.enabled true
  28. spark.eventLog.dir hdfs://mycluster/spark_log
  29. spark.history.fs.logDirectory hdfs://mycluster/spark_log
  30. //复制到其他主机
  31. scp -r spark-2.3.4 node02:`pwd`
  32. scp -r spark-2.3.4 node03:`pwd`
  33. scp -r spark-2.3.4 node04:`pwd`
  34. node01
  35. start-dfs.sh //只启动hdfs
  36. sbin/start-all.sh //启动master和slaves
  37. node02
  38. sbin/start-master.sh
  39. 测试 node01
  40. spark-shell --master spark://node01:7077,node02:7077
  41. scala> sc.textFile("hdfs://mycluster/usr/bryan/wc.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
  42. (jim,1)
  43. (bryan,2)
  44. (lily,1)
  45. (peter,2)
  46. (hello,2)
  47. (msb,2)
  48. (pp,1)
  49. (jiam,1)
  50. 必须要加collect回收数据,要不然不能显示
  51. spark-shell 只能client模式

spark-shell

查看各节点角色
spark-shell在哪个节点启动的,SparkSubmit进程(driver)就在哪里

  1. node01
  2. [root@node01 ~]# jps
  3. 4848 SparkSubmit
  4. 3569 ResourceManager
  5. 4451 Master
  6. 3460 DFSZKFailoverController
  7. 3289 JournalNode
  8. 5066 Jps
  9. 3087 NameNode
  10. node02
  11. [root@node02 spark-2.3.4]# jps
  12. 2368 QuorumPeerMain
  13. 3456 Worker
  14. 2626 DataNode
  15. 4035 Jps
  16. 3572 Master
  17. 2839 DFSZKFailoverController
  18. 2922 NodeManager
  19. 3114 ResourceManager
  20. 2555 NameNode
  21. 3867 CoarseGrainedExecutorBackend
  22. 2717 JournalNode
  23. node03
  24. [root@node03 ~]# jps
  25. 2196 QuorumPeerMain
  26. 2406 JournalNode
  27. 3334 Jps
  28. 3143 CoarseGrainedExecutorBackend
  29. 2504 NodeManager
  30. 2317 DataNode
  31. 2877 Worker
  32. node04
  33. [root@node04 ~]# jps
  34. 2212 DataNode
  35. 2326 NodeManager
  36. 2103 QuorumPeerMain
  37. 2920 CoarseGrainedExecutorBackend
  38. 3131 Jps
  39. 2670 Worker

spark-submit client

  1. //创建提交脚本,用官方例子
  2. vi submit.sh
  3. #!/bin/bash
  4. jar=$SPARK_HOME/examples/jars/spark-examples_2.11-2.3.4.jar
  5. master=spark://node01:7077,node02:7077
  6. #master=yarn
  7. mode=client
  8. class=org.apache.spark.examples.SparkPi
  9. $SPARK_HOME/bin/spark-submit \
  10. --master $master \
  11. --deploy-mode $mode \
  12. --class $class \
  13. $jar \
  14. 1000
  15. node03
  16. ./submit
  17. node03节点有SparkSubmit进程

spark-submit cluster

mode=cluster

node03
./submit
此时这个client把程序提交到集群然后退出,集群某个节点启动driver,继续跑!

  1. [root@node01 spark-2.3.4]# jps
  2. 3569 ResourceManager
  3. 10755 Jps
  4. 3460 DFSZKFailoverController
  5. 3289 JournalNode
  6. 7438 Master
  7. 3087 NameNode
  8. [root@node02 spark-2.3.4]# jps
  9. 2368 QuorumPeerMain
  10. 10961 CoarseGrainedExecutorBackend
  11. 2626 DataNode
  12. 9411 Master
  13. 6854 HistoryServer
  14. 2839 DFSZKFailoverController
  15. 5960 Worker
  16. 2922 NodeManager
  17. 3114 ResourceManager
  18. 2555 NameNode
  19. 11004 Jps
  20. 2717 JournalNode
  21. [root@node03 ~]# jps
  22. 5105 Worker
  23. 9522 SparkSubmit //client提交进程
  24. 2196 QuorumPeerMain
  25. 2406 JournalNode
  26. 9591 Jps
  27. 2504 NodeManager
  28. 9577 CoarseGrainedExecutorBackend
  29. 2317 DataNode
  30. [root@node04 ~]# jps
  31. 8192 DriverWrapper //driver
  32. 2212 DataNode
  33. 2326 NodeManager
  34. 2103 QuorumPeerMain
  35. 4666 Worker
  36. 8250 CoarseGrainedExecutorBackend
  37. 8266 Jps
  38. [root@node04 ~]# jps
  39. 2212 DataNode
  40. 2326 NodeManager
  41. 2103 QuorumPeerMain
  42. 4666 Worker
  43. 8367 Jps

内存与CPU参数

  1. --deploy-mode
  2. --driver-memory 1024m driver
  3. #有能力拉取计算结果回自己的jvm
  4. #driverexecutor jvm中会存储中间计算过程的数据的元数据
  5. --driver-cores # cluster mode ,默认1
  6. --total-executor-cores #standalone
  7. --executor-cores 1 # standalone默认每节点一个executor并占用全部核,如果设置此参数每个node会出现多个executor yarn executor 1 core
  8. --executor-memory 1024m
  9. --num-executors NUM 默认2 yarn专属
  10. standalone 3worker,假设CPU足够
  11. --total-executor-cores
  12. 每个节点 一个exector 2个核
  13. --total-executor-cores
  14. --executor-cores 1
  15. 每个节点 二个exector executor 1个核

History

master一重启,历史记录就不存在了

  1. vi spark-defaults.conf
  2. hdfs dfs -mkidr -p /spark_log
  3. spark.eventLog.enabled true
  4. spark.eventLog.dir hdfs://mycluster/spark_log //存的,目录要提前创建
  5. spark.history.fs.logDirectory hdfs://mycluster/spark_log //取的
  6. a)修改配置一定要分发、重启服务
  7. b)计算层会自己将自己的计算日志存入hdfs
  8. c)手动启动:./sbin/start-history-server.sh ,用来取日志
  9. e)访问 启动有 history-server主机的主机名 默认端口是 18080

Spark on Yarn

  1. spark-env.sh
  2. export HADOOP_CONF_DIR=/opt/bigdata/hadoop-2.6.5/etc/hadoop
  3. spark-defaults.conf
  4. spark.eventLog.enabled true
  5. spark.eventLog.dir hdfs://mycluster/spark_log
  6. spark.history.fs.logDirectory hdfs://mycluster/spark_log
  7. spark.yarn.jars hdfs://mycluster/work/spark_lib/jars/*
  8. sparkjar包上传到hdfs,要不然yarn每次都要上传
  9. hdfs dfs -mkdir -p /work/spark_lib/jars
  10. /opt/spark-2.3.4/jars
  11. hdfs dfs -put ./* /work/spark_lib/jars
  12. 没配置jars
  13. [root@node01 conf]# hdfs dfs -ls /user/root/.sparkStaging/application_1627546735806_0006
  14. Found 3 items
  15. -rw-r--r-- 2 root supergroup 190850 2021-07-29 16:34 /user/root/.sparkStaging/application_1627546735806_0006/__spark_conf__.zip
  16. -rw-r--r-- 2 root supergroup 232507850 2021-07-29 16:34 /user/root/.sparkStaging/application_1627546735806_0006/__spark_libs__5988972910506987504.zip
  17. -rw-r--r-- 2 root supergroup 1991004 2021-07-29 16:34 /user/root/.sparkStaging/application_1627546735806_0006/spark-examples_2.11-2.3.4.jars
  18. 配置之后
  19. [root@node01 conf]# hdfs dfs -ls /user/root/.sparkStaging/application_1627546735806_0005
  20. Found 2 items
  21. -rw-r--r-- 2 root supergroup 221760 2021-07-29 16:33 /user/root/.sparkStaging/application_1627546735806_0005/__spark_conf__.zip
  22. -rw-r--r-- 2 root supergroup 1991004 2021-07-29 16:33 /user/root/.sparkStaging/application_1627546735806_0005/spark-examples_2.11-2.3.4.jar
  23. client:ExecutorLauncher
  24. cluster:ApplicationMaster
  25. spark-shell 只支持 client模式
  26. spark-submit 跑非repl 的可以是client、cluster

  1. yarn-site.xml
  2. <property>
  3. <name>yarn.nodemanager.resource.memory-mb</name>
  4. <value>4096</value>
  5. </property>
  6. <property>
  7. <name>yarn.nodemanager.resource.cpu-vcores</name>
  8. <value>4</value>
  9. </property>
  10. <property>
  11. <name>yarn.nodemanager.vmem-check-enabled</name>
  12. <value>false</value>
  13. </property>

client

  1. //创建提交脚本,用官方例子
  2. vi submit.sh
  3. #!/bin/bash
  4. jar=$SPARK_HOME/examples/jars/spark-examples_2.11-2.3.4.jar
  5. #master=spark://node01:7077,node02:7077
  6. master=yarn
  7. mode=client
  8. class=org.apache.spark.examples.SparkPi
  9. $SPARK_HOME/bin/spark-submit \
  10. --master $master \
  11. --deploy-mode $mode \
  12. --class $class \
  13. $jar \
  14. 1000
  15. node03
  16. ./submit.sh

角色查看

  1. [root@node01 spark-2.3.4]# jps
  2. 3460 DFSZKFailoverController
  3. 15320 Jps
  4. 3289 JournalNode
  5. 14109 ResourceManager
  6. 3087 NameNode
  7. [root@node02 spark-2.3.4]# jps
  8. 2368 QuorumPeerMain
  9. 2626 DataNode
  10. 6854 HistoryServer
  11. 2839 DFSZKFailoverController
  12. 16505 Jps
  13. 2555 NameNode
  14. 2717 JournalNode
  15. 15117 NodeManager
  16. 15214 ResourceManager
  17. [root@node03 ~]# jps
  18. 13970 CoarseGrainedExecutorBackend
  19. 2196 QuorumPeerMain
  20. 12773 NodeManager
  21. 2406 JournalNode
  22. 13768 SparkSubmit //driver
  23. 2317 DataNode
  24. 14030 Jps
  25. [root@node04 ~]# jps
  26. 2212 DataNode
  27. 2103 QuorumPeerMain
  28. 12137 CoarseGrainedExecutorBackend
  29. 12186 Jps
  30. 11148 NodeManager
  31. 12095 ExecutorLauncher //appmaster

cluster

将mode改成cluster

角色查看

  1. [root@node01 opt]# jps
  2. 3460 DFSZKFailoverController
  3. 16552 Jps
  4. 3289 JournalNode
  5. 14109 ResourceManager
  6. 3087 NameNode
  7. [root@node02 spark-2.3.4]# jps
  8. 2368 QuorumPeerMain
  9. 2626 DataNode
  10. 6854 HistoryServer
  11. 2839 DFSZKFailoverController
  12. 18010 Jps
  13. 2555 NameNode
  14. 2717 JournalNode
  15. 15117 NodeManager
  16. 15214 ResourceManager
  17. [root@node03 ~]# jps
  18. 2196 QuorumPeerMain
  19. 12773 NodeManager
  20. 15141 SparkSubmit //client
  21. 2406 JournalNode
  22. 15222 Jps
  23. 2317 DataNode
  24. [root@node04 ~]# jps
  25. 13426 ApplicationMaster //driver和appmaster都在这一个进程
  26. 13474 Jps
  27. 2212 DataNode
  28. 2103 QuorumPeerMain
  29. 11148 NodeManager

SparkSQL

将hive-site.xml复制到spark/conf文件中
hive-site.xml

  1. <property>
  2. <name>hive.metastore.uris</name>
  3. <value>thrift://node03:9083</value>
  4. </property>

hive metastore

node03

hive —service metastore

hive cli

node04
hive

spark sql cli

  1. spark-sql --master yarn

spark thrft-server

相当于hiveserver2
node01

  1. $SPARK_HOME/sbin/start-thriftserver.sh --master yarn

beeline
node02

  1. $SPARK_HOME/bin/beeline
  2. !connect jdbc:hive2://node01:10000

分区分表实验

数据准备

  1. part1.txt
  2. 1 小明1 lol,book,movie beijing:mashibing,shanghai:pudong
  3. 2 小明2 lol,book,movie beijing:mashibing,shanghai:pudong
  4. 3 小明3 lol,book,movie beijing:mashibing,shanghai:pudong
  5. 4 小明4 lol,book,movie beijing:mashibing,shanghai:pudong
  6. 5 小明5 lol,movie beijing:mashibing,shanghai:pudong
  7. 6 小明6 lol,book,movie beijing:mashibing,shanghai:pudong
  8. 7 小明7 lol,book beijing:mashibing,shanghai:pudong
  9. 8 小明8 lol,book beijing:mashibing,shanghai:pudong
  10. 9 小明9 lol,book,movie beijing:mashibing,shanghai:pudong
  11. 10 小明10 lol,book,skill beijing:mashibing,shanghai:pudong
  12. part2.txt
  13. 11 小明11 lol,book,movie beijing:mashibing,shanghai:pudong
  14. 12 小明12 lol,book,movie beijing:mashibing,shanghai:pudong
  15. 13 小明13 lol,book,movie beijing:mashibing,shanghai:pudong
  16. 14 小明14 lol,book,movie beijing:mashibing,shanghai:pudong
  17. 15 小明15 lol,movie beijing:mashibing,shanghai:pudong
  18. 16 小明16 lol,book,movie beijing:mashibing,shanghai:pudong
  19. 17 小明17 lol,book beijing:mashibing,shanghai:pudong
  20. 18 小明18 lol,book beijing:mashibing,shanghai:pudong
  21. 19 小明19 lol,book,movie beijing:mashibing,shanghai:pudong
  22. 20 小明20 lol,book,skill beijing:mashibing,shanghai:pudong
  23. part3.txt
  24. 21 小明21 lol,book,movie beijing:mashibing,shanghai:pudong
  25. 22 小明22 lol,book,movie beijing:mashibing,shanghai:pudong
  26. 23 小明23 lol,book,movie beijing:mashibing,shanghai:pudong
  27. 24 小明24 lol,book,movie beijing:mashibing,shanghai:pudong
  28. 25 小明25 lol,movie beijing:mashibing,shanghai:pudong
  29. 26 小明26 lol,book,movie beijing:mashibing,shanghai:pudong
  30. 27 小明27 lol,book beijing:mashibing,shanghai:pudong
  31. 28 小明28 lol,book beijing:mashibing,shanghai:pudong
  32. 29 小明29 lol,book,movie beijing:mashibing,shanghai:pudong
  33. 30 小明30 lol,book,skill beijing:mashibing,shanghai:pudong
  34. part4.txt
  35. 41 小明41 lol,book,movie beijing:mashibing,shanghai:pudong
  36. 42 小明42 lol,book,movie beijing:mashibing,shanghai:pudong
  37. 43 小明43 lol,book,movie beijing:mashibing,shanghai:pudong
  38. 44 小明44 lol,book,movie beijing:mashibing,shanghai:pudong
  39. 45 小明45 lol,movie beijing:mashibing,shanghai:pudong
  40. 46 小明46 lol,book,movie beijing:mashibing,shanghai:pudong
  41. 47 小明47 lol,book beijing:mashibing,shanghai:pudong
  42. 48 小明48 lol,book beijing:mashibing,shanghai:pudong
  43. 49 小明49 lol,book,movie beijing:mashibing,shanghai:pudong
  44. 50 小明50 lol,book,skill beijing:mashibing,shanghai:pudong

仅分区

内部表

  1. create table part_in (
  2. id int,
  3. name string,
  4. likes array<string>,
  5. address map<string,string>
  6. )
  7. partitioned by(data_dt string,channel int)
  8. row format delimited
  9. fields terminated by '\t'
  10. collection items terminated by ','
  11. map keys terminated by ':';
  12. 分区字段相当于虚字段,实际数据中不存在,但hive查询时会显示分区列
  13. 本地路径,复制文件到hdfs
  14. load data local inpath '/opt/part1.txt' into table part_in partition (data_dt='20210701',channel=110);
  15. load data local inpath '/opt/part2.txt' into table part_in partition (data_dt='20210701',channel=120);
  16. load data local inpath '/opt/part3.txt' into table part_in partition (data_dt='20210702',channel=110);
  17. load data local inpath '/opt/part4.txt' into table part_in partition (data_dt='20210702',channel=120);
  18. hdfs
  19. 移动文件到hive
  20. load data inpath '/data/part1.txt' into table part_in partition (data_dt='20210701',channel=110);
  21. 直接insert into xx values这样插入,效率低
  22. 会产生以下这种copy文件,每插入一次,产生一个
  23. 000000_0_copy_1
  24. 000000_0_copy_2

image.png
image.png
image.png

查询

  1. hive> select * from part_in;
  2. OK
  3. 1 小明1 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  4. 2 小明2 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  5. 3 小明3 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  6. 4 小明4 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  7. 5 小明5 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  8. 6 小明6 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  9. 7 小明7 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  10. 8 小明8 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  11. 9 小明9 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  12. 10 小明10 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  13. 11 小明11 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  14. 12 小明12 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  15. 13 小明13 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  16. 14 小明14 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  17. 15 小明15 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  18. 16 小明16 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  19. 17 小明17 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  20. 18 小明18 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  21. 19 小明19 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  22. 20 小明20 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  23. 21 小明21 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  24. 22 小明22 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  25. 23 小明23 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  26. 24 小明24 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  27. 25 小明25 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  28. 26 小明26 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  29. 27 小明27 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  30. 28 小明28 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  31. 29 小明29 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  32. 30 小明30 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  33. 41 小明41 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  34. 42 小明42 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  35. 43 小明43 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  36. 44 小明44 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  37. 45 小明45 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  38. 46 小明46 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  39. 47 小明47 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  40. 48 小明48 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  41. 49 小明49 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  42. 50 小明50 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  43. Time taken: 0.14 seconds, Fetched: 40 row(s)
  44. //用分区列查找,只加载对应文件,速度快
  45. hive> select * from part_in where data_dt=20210701;
  46. OK
  47. 1 小明1 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  48. 2 小明2 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  49. 3 小明3 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  50. 4 小明4 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  51. 5 小明5 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  52. 6 小明6 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  53. 7 小明7 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  54. 8 小明8 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  55. 9 小明9 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  56. 10 小明10 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  57. 11 小明11 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  58. 12 小明12 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  59. 13 小明13 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  60. 14 小明14 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  61. 15 小明15 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  62. 16 小明16 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  63. 17 小明17 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  64. 18 小明18 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  65. 19 小明19 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  66. 20 小明20 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  67. Time taken: 0.784 seconds, Fetched: 20 row(s)

外部表

删除外部表只会删除元数据,不会删数据,安全性高。数仓一般这样用

  1. create external table part_out (
  2. id int,
  3. name string,
  4. likes array<string>,
  5. address map<string,string>
  6. )
  7. partitioned by(data_dt string,channel int)
  8. row format delimited
  9. fields terminated by '\t'
  10. collection items terminated by ','
  11. map keys terminated by ':'
  12. location '/var/bigdata/hive_remote/warehouse/part_out';
  13. 本地路径,复制文件到hdfs
  14. load data local inpath '/opt/part1.txt' [overwrite] into table part_out partition (data_dt='20210701',channel=110);
  15. load data local inpath '/opt/part2.txt' [overwrite] into table part_out partition (data_dt='20210701',channel=120);
  16. load data local inpath '/opt/part3.txt' [overwrite] into table part_out partition (data_dt='20210702',channel=110);
  17. load data local inpath '/opt/part4.txt' [overwrite] into table part_out partition (data_dt='20210702',channel=120);
  18. hdfs
  19. 移动文件到hive
  20. load data inpath '/data/part1.txt' into table part_out partition (data_dt='20210701',channel=110);

目录结构和内部表一样

分区操作

  1. --给分区表添加分区列的值
  2. alter table table_name add partition(col_name=col_value)
  3. --删除分区列的值
  4. alter table table_name drop partition(col_name=col_value)
  5. /*
  6. 注意:
  7. 1、添加分区列的值的时候,如果定义的是多分区表,那么必须给所有的分区列都赋值
  8. 2、删除分区列的值的时候,无论是单分区表还是多分区表,都可以将指定的分区(多分区指定一个分区可以级联删除)进行删除,级联删除
  9. 在已有数据目录,建立外部分区表,需要修改分区
  10. --修复分区
  11. msck repair table psn7;

动态分区

仅分桶

  1. create table part_bucket (
  2. id int,
  3. name string,
  4. likes array<string>,
  5. address map<string,string>
  6. )
  7. clustered by (id) into 4 buckets
  8. row format delimited
  9. fields terminated by '\t'
  10. collection items terminated by ','
  11. map keys terminated by ':';
  12. //不能用load data
  13. insert into table part_bucket select id,name,likes,address from part_in where data_dt='20210701' and channel=110;
  14. insert into table part_bucket select id,name,likes,address from part_in where data_dt='20210701' and channel=120;

image.png

分区再分桶

内部表

  1. create table part_in_bucket (
  2. id int,
  3. name string,
  4. likes array<string>,
  5. address map<string,string>
  6. )
  7. partitioned by(data_dt string,channel int)
  8. clustered by (id) into 4 buckets
  9. row format delimited
  10. fields terminated by '\t'
  11. collection items terminated by ','
  12. map keys terminated by ':';
  13. //不能用load data
  14. insert overwrite table part_in_bucket partition (data_dt='20210701',channel=110)
  15. select id,name,likes,address from part_in where data_dt='20210701' and channel=110;
  16. //不能select * ,会产生6个字段,包括分区列 overwrite是覆盖
  17. insert overwrite table part_in_bucket partition (data_dt='20210701',channel=120)
  18. select id,name,likes,address from part_in where data_dt='20210701' and channel=120;
  19. insert overwrite table part_in_bucket partition (data_dt='20210702',channel=110)
  20. select id,name,likes,address from part_in where data_dt='20210702' and channel=110;
  21. insert overwrite table part_in_bucket partition (data_dt='20210702',channel=120)
  22. select id,name,likes,address from part_in where data_dt='20210702' and channel=120;

image.png

  1. hive> select *from part_in_bucket;
  2. OK
  3. 8 小明8 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  4. 4 小明4 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  5. 9 小明9 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  6. 5 小明5 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  7. 1 小明1 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  8. 10 小明10 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  9. 6 小明6 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  10. 2 小明2 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  11. 7 小明7 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  12. 3 小明3 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 110
  13. 20 小明20 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  14. 16 小明16 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  15. 12 小明12 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  16. 17 小明17 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  17. 13 小明13 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  18. 18 小明18 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  19. 14 小明14 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  20. 19 小明19 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  21. 15 小明15 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  22. 11 小明11 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210701 120
  23. 28 小明28 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  24. 24 小明24 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  25. 29 小明29 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  26. 25 小明25 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  27. 21 小明21 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  28. 30 小明30 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  29. 26 小明26 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  30. 22 小明22 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  31. 27 小明27 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  32. 23 小明23 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 110
  33. 48 小明48 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  34. 44 小明44 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  35. 49 小明49 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  36. 45 小明45 ["lol","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  37. 41 小明41 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  38. 50 小明50 ["lol","book","skill"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  39. 46 小明46 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  40. 42 小明42 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  41. 47 小明47 ["lol","book"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  42. 43 小明43 ["lol","book","movie"] {"beijing":"mashibing","shanghai":"pudong"} 20210702 120
  43. Time taken: 0.128 seconds, Fetched: 40 row(s)
  44. 可以看到同一分区的数据顺序错乱了,按照id hash到了不同的桶中

外部表

create external table part_out_bucket (
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(data_dt string,channel int)
clustered by (id) into 4 buckets 
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
location '/var/bigdata/hive_remote/warehouse/part_out_bucket';



//不能用load data
insert overwrite table part_out_bucket partition (data_dt='20210701',channel=110)
  select id,name,likes,address  from part_out where data_dt='20210701' and channel=110;
//不能select * ,会产生6个字段,包括分区列
insert overwrite table part_out_bucket partition (data_dt='20210701',channel=120)
  select id,name,likes,address  from part_out where data_dt='20210701' and channel=120;

insert overwrite table part_out_bucket partition (data_dt='20210702',channel=110)
  select id,name,likes,address  from part_out where data_dt='20210702' and channel=110;

insert overwrite table part_out_bucket partition (data_dt='20210702',channel=120)
  select id,name,likes,address  from part_out where data_dt='20210702' and channel=120;
hive> select * from part_out_bucket;
OK
8    小明8    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
4    小明4    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
9    小明9    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
5    小明5    ["lol","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
1    小明1    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
10    小明10    ["lol","book","skill"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
6    小明6    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
2    小明2    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
7    小明7    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
3    小明3    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    110
20    小明20    ["lol","book","skill"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
16    小明16    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
12    小明12    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
17    小明17    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
13    小明13    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
18    小明18    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
14    小明14    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
19    小明19    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
15    小明15    ["lol","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
11    小明11    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210701    120
28    小明28    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
24    小明24    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
29    小明29    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
25    小明25    ["lol","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
21    小明21    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
30    小明30    ["lol","book","skill"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
26    小明26    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
22    小明22    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
27    小明27    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
23    小明23    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    110
48    小明48    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
44    小明44    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
49    小明49    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
45    小明45    ["lol","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
41    小明41    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
50    小明50    ["lol","book","skill"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
46    小明46    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
42    小明42    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
47    小明47    ["lol","book"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
43    小明43    ["lol","book","movie"]    {"beijing":"mashibing","shanghai":"pudong"}    20210702    120
Time taken: 0.117 seconds, Fetched: 40 row(s)

spark-sql操作

create table part_in_spark (
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(data_dt string,channel int)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';    


分区字段相当于虚字段,实际数据中不存在,但hive查询时会显示分区列

本地路径,复制文件到hdfs
load data local inpath '/opt/part1.txt' into table part_in_spark partition (data_dt='20210701',channel=110);



create external table part_out_spark (
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(data_dt string,channel int)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
location '/var/bigdata/hive_remote/warehouse/part_out_spark';


本地路径,复制文件到hdfs
load data local inpath '/opt/part1.txt' overwrite into table part_out_spark partition (data_dt='20210701',channel=110);




create table part_in_spark_bucket (
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(data_dt string,channel int)
clustered by (id) into 4 buckets 
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';    



//不能用load data
//
Error in query: Output Hive table `default`.`part_in_spark_bucket` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
2.3.4不支持分桶表的插入
//

insert overwrite table part_in_spark_bucket partition (data_dt='20210701',channel=110)
  select id,name,likes,address  from part_in_spark where data_dt='20210701' and channel=110;



create external table part_out_spark_bucket (
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(data_dt string,channel int)
clustered by (id) into 4 buckets 
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
location '/var/bigdata/hive_remote/warehouse/part_out_spark_bucket';



//不能用load data
//
Error in query: Output Hive table `default`.`part_in_spark_bucket` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
2.3.4不支持分桶表的插入
//
insert overwrite table part_out_spark_bucket partition (data_dt='20210701',channel=110)
  select id,name,likes,address  from part_out where data_dt='20210701' and channel=110;



对原有hive表操作

load data local inpath '/opt/part1.txt' into table part_in partition (data_dt='20210701',channel=110);
load data local inpath '/opt/part1.txt' [overwrite] into table part_out partition (data_dt='20210701',channel=110);

关于压缩包格式

textFile可以读取gzip和bzip2压缩的数据
但有个问题,压缩包中文件第一行会带有文件元数据信息以及具体文本内容

比如这样

a1       0000644 0000000 0000000 00000000025 14076475501 010006  0                       ustar   root                            root                                                                                                                                                                                                                   a1a11
a2 a22
a3 a33
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           b1                                                                                                  0000644 0000000 0000000 00000000025 14076475516 010015  0                                                                                                    ustar   root                            root                                                                                                                                                                                                                   b1 b11
b2 b22
b3 b33

gzip不能分割,只能一个并行度/分区
bzip2可以分割,可以指定并行度/分区

创建其他表格式


create table pua
(
 id int,
 name string
)
row format delimited
fields terminated by '\t'
stored as PARQUET


查看当前表格式
desc formatted pua
hive> desc formatted pua;
OK
# col_name                data_type               comment             

id                      int                                         
name                    string                                      

# Detailed Table Information          
Database:               default                  
Owner:                  root                     
CreateTime:             Sat Jul 31 14:24:40 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://mycluster/var/bigdata/hive_remote/warehouse/pua     
Table Type:             MANAGED_TABLE            
Table Parameters:          
    COLUMN_STATS_ACCURATE    {\"BASIC_STATS\":\"true\"}
    numFiles                0                   
    numRows                 0                   
    rawDataSize             0                   
    totalSize               0                   
    transient_lastDdlTime    1627712680          

# Storage Information          
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe     
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat      // PARQUET
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat      //PARQUET
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:          
    field.delim             \t                  
    serialization.format    \t                  
Time taken: 0.057 seconds, Fetched: 32 row(s)


    ===================================

hive> desc formatted part_in_spark;
OK
# col_name                data_type               comment             

id                      int                                         
name                    string                                      
likes                   array<string>                               
address                 map<string,string>                          

# Partition Information          
# col_name                data_type               comment             

data_dt                 string                                      
channel                 int                                         

# Detailed Table Information          
Database:               default                  
Owner:                  root                     
CreateTime:             Sat Jul 31 13:57:46 CST 2021     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://mycluster/var/bigdata/hive_remote/warehouse/part_in_spark     
Table Type:             MANAGED_TABLE            
Table Parameters:          
    numFiles                1                   
    numPartitions           1                   
    numRows                 0                   
    rawDataSize             0                   
    spark.sql.create.version    2.3.4               
    spark.sql.sources.schema.numPartCols    2                   
    spark.sql.sources.schema.numParts    1                   
    spark.sql.sources.schema.part.0    {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"likes\",\"type\":{\"type\":\"array\",\"elementType\":\"string\",\"containsNull\":true},\"nullable\":true,\"metadata\":{}},{\"name\":\"address\",\"type\":{\"type\":\"map\",\"keyType\":\"string\",\"valueType\":\"string\",\"valueContainsNull\":true},\"nullable\":true,\"metadata\":{}},{\"name\":\"data_dt\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"channel\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
    spark.sql.sources.schema.partCol.0    data_dt             
    spark.sql.sources.schema.partCol.1    channel             
    totalSize               575                 
    transient_lastDdlTime    1627711066          

# Storage Information          
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe     
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        // TEXTFILE
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      //TEXTFILE
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:          
    colelction.delim        ,                   
    field.delim             \t                  
    mapkey.delim            :                   
    serialization.format    \t                  
Time taken: 0.097 seconds, Fetched: 48 row(s)
hive>

SparkStreaming

你写的代码都是丢到另一个线程执行的。
_ 作用域分为三个级别:
application
job
rdd:task

直接裸露的代码
println(“aaaaaaa”) //application,只执行一次

val res: DStream[(String, Int)] = format.transform( //每job调用一次
(rdd) => {
//我们的函数式每job级别的
println
(“bbbbbbbb”) //job driver
_rdd.map(x=>{
_println
(“cccccc”) _//task executor
_x
})
}
)
res

res.print() //job