概况

技术选型
image-20201102094755221.png
image-20201102095716354.png

服务器规划

服务名称 子服务 hadoop102 hadoop103 hadoop104
HDFS NameNode +
HDFS DataNode + + +
HDFS SecondaryNameNode +
Yarn NodeManager + + +
Yarn ResourceManager +
Zookeeper Zookeeper Server + + +
Flume(日志) Flume + +
Kafka Kafka + + +
Flume(消费) Flume +
Hive Hive +
MySQL MySQL +
Sqoop Sqoop +
Presto Coordinator +
Presto Worker + +
Azkaban Azkaban WebServer +
Azkaban AzkabanExecutorServer +
Druid Druid + + +
Kylin +
HBase HMaster +
HBase HRegionServer + + +
Superset +
Atlas +
Solr Jar +
Griffin +

准备工作

先搭建好集群,确保hadoop各个模块可以启动。

安装LZO

下载包hadoop-lzo-0.4.20.jar,放在/opt/module/hadoop-2.8.2/share/hadoop/common目录下,然后用脚本分发到103、104 。

  1. [root@hadoop102 common]# xsync hadoop-lzo-0.4.20.jar

core-site.xml 增加配置支持 LZO 压缩

  1. [root@hadoop103 hadoop-2.8.2]# cd etc/hadoop/
  2. [root@hadoop103 hadoop]# vim core-site.xml
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <configuration>
  4. <property>
  5. <name>io.compression.codecs</name>
  6. <value>
  7. org.apache.hadoop.io.compress.GzipCodec,
  8. org.apache.hadoop.io.compress.DefaultCodec,
  9. org.apache.hadoop.io.compress.BZip2Codec,
  10. org.apache.hadoop.io.compress.SnappyCodec,
  11. com.hadoop.compression.lzo.LzoCodec,
  12. com.hadoop.compression.lzo.LzopCodec
  13. </value>
  14. </property>
  15. <property>
  16. <name>io.compression.codec.lzo.class</name>
  17. <value>com.hadoop.compression.lzo.LzoCodec</value>
  18. </property>
  19. </configuration>

分发

  1. [root@hadoop102 hadoop]# xsync core-site.xml

真正使用时还需要对上传文件创建索引

  1. [root@hadoop102 module]# hadoop jar /opt/module/hadoop-2.8.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/filename.lzo

其中,/input是hdfs中文件所在目录。

参数调优

NameNode 有一个工作线程池,用来处理不同 DataNode 的并发心跳以及客户端并发 的元数据操作。 对于大集群或者有大量客户端的集群来说,通常需要增大参数。

dfs.namenode.handler.count 的默认值10,当参数值为20*math.log(n)时最优,n是机器的台数。

修改hdfs-site.xml

  1. <property>
  2. <name>dfs.namenode.handler.count</name>
  3. <value>21</value>
  4. </property>

安装Zookeeper

详见Zookeeper,这里主要介绍Zookeeper的集群启动脚本,脚本一般存放在102的/root/bin目录下。

脚本如下

  1. #! /bin/bash
  2. case $1 in
  3. "start"){
  4. for i in hadoop102 hadoop103 hadoop104
  5. do
  6. ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh start"
  7. done
  8. };;
  9. "stop"){
  10. for i in hadoop102 hadoop103 hadoop104
  11. do
  12. ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh stop"
  13. done
  14. };;
  15. "status"){
  16. for i in hadoop102 hadoop103 hadoop104
  17. do
  18. ssh $i "/opt/module/zookeeper-3.4.10/bin/zkServer.sh status"
  19. done
  20. };;
  21. esac

之后给与权限

  1. [root@hadoop102 bin]# chmod 777 zk.sh

日志生成

将日志包log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar拷贝到102和103上。

执行java命令,打印日志到指定文件

  1. [root@hadoop104 module]# java -classpath log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar com.atguigu.appclient.AppMain >/opt/module/test.log

也可以不输出

  1. [root@hadoop102 module]# java -classpath log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar com.atguigu.appclient.AppMain 1>/dev/null 2>/dev/null

这里/dev/null是黑洞,所有进入的内容都会消失。

  • 标准输入 0:从键盘获得输入 /proc/self/fd/0
  • 标准输出 1:输出到屏幕(即控制台) /proc/self/fd/1
  • 错误输出 2:输出到屏幕(即控制台) /proc/self/fd/2

最后,在/tmp/log目录下可以看到日志。

下面写日志集群脚本lg.sh,脚本在/root/bin目录下

  1. for i in hadoop102 hadoop103
  2. do
  3. echo ---------------$i生成日志---------------
  4. ssh $i "java -classpath /opt/module/log-collector-1.0-SNAPSHOT-jar-with-dependencies.jar com.atguigu.appclient.AppMain $1 $2 >/dev/null 2>&1 &"
  5. done

授权

  1. [root@hadoop102 bin]# chmod 777 lg.sh

时间同步

时间同步脚本dt.sh

  1. #! /bin/bash
  2. for i in hadoop102 hadoop103 hadoop104
  3. do
  4. echo ---------------$i同步时间---------------
  5. ssh $i "sudo date -s $1"
  6. done

授权

  1. [root@hadoop102 bin]# chmod 777 dt.sh

Linux环境变量

Linux有两种Shell,登录式和非登录式(SSH连接)。登录式会自动加载/etc/profile,非登录式会自动加载~/.bashrc

/etc/profile里面的环境变量是对系统内所有用户有效,~/.bashrc只对某一特定用户有效。

所以,要把/etc/profile的环境变量追加到~/.bashrc确保SSH访问时不会出错。

  1. [root@hadoop102 ~]# cat /etc/profile >> ~/.bashrc
  2. [root@hadoop103 ~]# cat /etc/profile >> ~/.bashrc
  3. [root@hadoop104 ~]# cat /etc/profile >> ~/.bashrc

状态查询脚本

统一查询状态

  1. #! /bin/bash
  2. for i in hadoop102 hadoop103 hadoop104
  3. do
  4. echo ---------------$i状态查询---------------
  5. ssh $i "$*"
  6. done

安装Flume(日志)

采集日志的Flume安装在102、103上,104后面安装消费Flume。

安装包解压缩后,修改flume-env.sh.template文件名称为flume-env.sh,文件路径在/opt/module/flume-1.7.0/conf下

  1. [root@hadoop102 conf]# mv flume-env.sh.template flume-env.sh

然后修改flume-env.sh文件的JAVA_HOME。

再同步分发到103、104(这里也要,后面104也要上flume),这样就安装完成。

Flume有两个组件Source、Channel

  • Source:包括Taildir Source、Exec Source、Spooling Source,这里使用Taildir Source,Taildir Source可以断点续传、多目录。
  • Channel:这里采用Kafka Channel,省去了Sink,提高了效率。

下面是具体配置,配置目录在/opt/module/flume-1.7.0/conf,新建file-flume-kafka.conf文件,名字可以自定义(见名知意即可)。

  1. a1.sources=r1 #组件定义
  2. a1.channels=c1 c2
  3. #configure source
  4. a1.sources.r1.type = TAILDIR #taildir方式读数据
  5. a1.sources.r1.positionFile = /opt/module/flume-1.7.0/test/log_position.json #记录日志读取位置
  6. a1.sources.r1.filegroups = f1
  7. a1.sources.r1.filegroups.f1 = /tmp/logs/app.+ #读取日志位置
  8. a1.sources.r1.fileHeader = true
  9. a1.sources.r1.channels = c1 c2
  10. #interceptor
  11. a1.sources.r1.interceptors = i1 i2
  12. a1.sources.r1.interceptors.i1.type = com.hxr.flume.interceptor.LogETLInterceptor$Builder #ETL拦截器
  13. a1.sources.r1.interceptors.i2.type = com.hxr.flume.interceptor.LogTypeInterceptor$Builder #日志类型拦截器
  14. a1.sources.r1.selector.type=multiplexing #根据日志类型分数据
  15. a1.sources.r1.selector.header=topic
  16. a1.sources.r1.selector.mapping.topic_start=c1
  17. a1.sources.r1.selector.mapping.topic_event=c2
  18. #configure channel
  19. a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
  20. a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
  21. a1.channels.c1.kafka.topic = topic_start #start类型的发往c1
  22. a1.channels.c1.parseAsFlumeEvent=false
  23. a1.channels.c1.kafka.consumer.group.id = flume-consumer
  24. a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
  25. a1.channels.c2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
  26. a1.channels.c2.kafka.topic = topic_event #event类型的发往c2
  27. a1.channels.c2.parseAsFlumeEvent=false
  28. a1.channels.c2.kafka.consumer.group.id = flume-consumer

将配置分发到103、104上。

下面接着实现拦截器,拦截器打包后放在/flume-1.7.0/lib目录下,然后分发到103、104。

在102、103上打开Flume

  1. [root@hadoop102 flume-1.7.0]# bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &
  2. [root@hadoop103 flume-1.7.0]# bin/flume-ng agent --name a1 --conf-file conf/file-flume-kafka.conf &

为了快速启动多台服务器,可以使用脚本

  1. #! /bin/bash
  2. case $1 in
  3. "start"){
  4. for i in hadoop102 hadoop103
  5. do
  6. echo ---------------启动$i采集flume---------------
  7. ssh $i "nohup /opt/module/flume-1.7.0/bin/flume-ng agent --name a1 --conf-file /opt/module/flume-1.7.0/conf/file-flume-kafka.conf
  8. -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume-1.7.0/test1 2>1 &"
  9. done
  10. };;
  11. "stop"){
  12. for i in hadoop102 hadoop103
  13. do
  14. echo ---------------停止$i采集flume---------------
  15. ssh $i "ps -ef | grep file-flume-kafka | grep -v grep | awk '{print $2}' | xargs kill"
  16. done
  17. };;
  18. esac

安装Kafka

安装包解压后。

创建logs文件夹

  1. [root@hadoop102 kafka_2.11]# mkdir logs

修改配置文件

  1. [root@hadoop102 kafka_2.11]# cd config/
  2. [root@hadoop102 config]# vim server.properties

在配置文件里修改broker.id(每台机器不能重复)、log.dirs(日志路径是刚才创建的logs)、Zookeeper连接

  1. broker.id=0
  2. delete.topic.enable=true #可以删除,而不是标记删除
  3. log.dirs=/opt/module/kafka_2.11/logs
  4. zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181

配置环境变量,将kafka安装目录配置到/etc/profile中

  1. #KAFKA_HOME
  2. export KAFKA_HOME=/opt/module/kafka_2.11
  3. export PATH=$PATH:$KAFKA_HOME/bin

执行source命令使其生效

  1. [root@hadoop102 module]# source /etc/profile

然后分发安装好的Kafka和profile。

然后到103、104修改配置文件里面的broker.id

启动kafka

  1. [root@hadoop102 kafka_2.11]# bin/kafka-server-start.sh config/server.properties &
  2. [root@hadoop103 kafka_2.11]# bin/kafka-server-start.sh config/server.properties &
  3. [root@hadoop104 kafka_2.11]# bin/kafka-server-start.sh config/server.properties &

停止kafka

  1. bin/kafka-server-stop.sh

使用脚本

  1. #! /bin/bash
  2. case $1 in
  3. "start"){
  4. for i in hadoop102 hadoop103 hadoop104
  5. do
  6. echo ---------------$i启动kafka---------------
  7. ssh $i "/opt/module/kafka_2.11/bin/kafka-server-start.sh -daemon /opt/module/kafka_2.11/config/server.properties"
  8. done
  9. };;
  10. "stop"){
  11. for i in hadoop102 hadoop103 hadoop104
  12. do
  13. echo ---------------$i停止kafka---------------
  14. ssh $i "/opt/module/kafka_2.11/bin/kafka-server-stop.sh"
  15. done
  16. };;
  17. esac

记得给脚本授权。

接着打通kafka的使用通道

创建kafka topic

  1. [root@hadoop102 kafka_2.11]# bin/kafka-topics.sh --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181 --create --replication-factor 1 --partitions 1 --topic topic_start
  2. [root@hadoop102 kafka_2.11]# bin/kafka-topics.sh --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181 --create --replication-factor 1 --partitions 1 --topic topic_event

查看topic列表

  1. [root@hadoop102 kafka_2.11]# bin/kafka-topics.sh --zookeeper hadoop102:2181 --list

需要的话可以通过下面的命令删除

  1. [root@hadoop102 kafka_2.11]# bin/kafka-topics.sh --delete --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181 --topic topic_start
  2. [root@hadoop102 kafka_2.11]# bin/kafka-topics.sh --delete --zookeeper hadoop102:2181,hadoop103:2181,hadoop104:2181 --topic topic_event

生产消息

  1. [root@hadoop102 kafka_2.11]# bin/kafka-console-producer.sh --broker-list hadoop102:9092 --topic topic_start

消费消息

  1. [root@hadoop102 kafka_2.11]# bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --from-beginning --topic topic_start

kafka producer压力测试

  1. [root@hadoop102 kafka_2.11]# bin/kafka-producer-perf-test.sh --topic test --record-size 100 --num-records 100000 --throughput -1 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092

kafak consumer压力测试

  1. [root@hadoop102 kafka_2.11]# bin/kafka-consumer-perf-test.sh --zookeeper hadoop102:2181 --topic test --fetch-size 10000 --messages 10000000 --threads 1

Flume消费

在104的/opt/module/flume-1.7.0/conf目录下创建kafka-flume-hdfs.conf文件,内容如下

  1. ## 组件
  2. a1.sources=r1 r2
  3. a1.channels=c1 c2
  4. a1.sinks=k1 k2
  5. ## source1
  6. a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
  7. a1.sources.r1.batchSize = 5000
  8. a1.sources.r1.batchDurationMillis = 2000
  9. a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
  10. a1.sources.r1.kafka.topics=topic_start
  11. ## source2
  12. a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
  13. a1.sources.r2.batchSize = 5000
  14. a1.sources.r2.batchDurationMillis = 2000
  15. a1.sources.r2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
  16. a1.sources.r2.kafka.topics=topic_event
  17. ## channel1
  18. a1.channels.c1.type = file
  19. a1.channels.c1.checkpointDir = /opt/module/flume-1.7.0/checkpoint/behavior1
  20. a1.channels.c1.dataDirs = /opt/module/flume-1.7.0/data/behavior1/
  21. a1.channels.c1.keep-alive = 6
  22. ## channel2
  23. a1.channels.c2.type = file
  24. a1.channels.c2.checkpointDir = /opt/module/flume-1.7.0/checkpoint/behavior2
  25. a1.channels.c2.dataDirs = /opt/module/flume-1.7.0/data/behavior2/
  26. a1.channels.c2.keep-alive = 6
  27. ## sink1
  28. a1.sinks.k1.type = hdfs
  29. a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
  30. a1.sinks.k1.hdfs.filePrefix = logstart-
  31. ##sink2
  32. a1.sinks.k2.type = hdfs
  33. a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
  34. a1.sinks.k2.hdfs.filePrefix = logevent-
  35. ## 不要产生大量小文件,生产环境 rollInterval 配置为 3600
  36. a1.sinks.k1.hdfs.rollInterval = 10
  37. a1.sinks.k1.hdfs.rollSize = 134217728
  38. a1.sinks.k1.hdfs.rollCount = 0
  39. a1.sinks.k2.hdfs.rollInterval = 10
  40. a1.sinks.k2.hdfs.rollSize = 134217728
  41. a1.sinks.k2.hdfs.rollCount = 0
  42. ## 控制输出文件是原生文件
  43. a1.sinks.k1.hdfs.fileType = CompressedStream
  44. a1.sinks.k2.hdfs.fileType = CompressedStream
  45. a1.sinks.k1.hdfs.codeC = lzop
  46. a1.sinks.k2.hdfs.codeC = lzop
  47. ## 拼装
  48. a1.sources.r1.channels = c1
  49. a1.sinks.k1.channel= c1
  50. a1.sources.r2.channels = c2
  51. a1.sinks.k2.channel= c2

写启动脚本f2.sh

  1. #! /bin/bash
  2. case $1 in
  3. "start"){
  4. for i in hadoop104
  5. do
  6. echo ---------------启动$i消费flume---------------
  7. ssh $i "nohup /opt/module/flume-1.7.0/bin/flume-ng agent --conf-file /opt/module/flume-1.7.0/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume-1.7.0/log.txt 2>&1 &"
  8. done
  9. };;
  10. "stop"){
  11. for i in hadoop104
  12. do
  13. echo ---------------停止$i消费flume---------------
  14. ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print $2}' | xargs kill"
  15. done
  16. };;
  17. esac

脚本授权chmod 777 f2.sh。

集群启动与停止

编写cluster.sh脚本,可以一键启动与停止所有程序,包括hdfs、yarn、Zookeeper、flume、kafka。

  1. #! /bin/bash
  2. case $1 in
  3. "start"){
  4. echo ---------------启动集群---------------
  5. /opt/module/hadoop-2.8.2/sbin/start-dfs.sh
  6. ssh hadoop103 "/opt/module/hadoop-2.8.2/sbin/start-yarn.sh"
  7. zk.sh start
  8. sleep 4s
  9. f1.sh start
  10. kf.sh start
  11. sleep 6s
  12. f2.sh start
  13. };;
  14. "stop"){
  15. echo ---------------停止集群---------------
  16. f2.sh stop
  17. kf.sh stop
  18. f1.sh stop
  19. zk.sh stop
  20. ssh hadoop103 "/opt/module/hadoop-2.8.2/sbin/stop-yarn.sh"
  21. /opt/module/hadoop-2.8.2/sbin/stop-dfs.sh
  22. };;
  23. esac

安装MySQL

安装包导入后,解压

  1. [root@hadoop102 software]# unzip mysql-libs.zip

解压得到client、server和connector,这里主要用到connector。

解压connector,完成。

接着安装client和server。这里是要root权限的。

查看102上是否已经安装了MySQL

  1. [root@hadoop102 mysql-libs]# rpm -qa|grep mysql

如果有,需要先删

  1. [root@hadoop102 mysql-libs]# rpm -e --nodeps mysql名称

这里有坑,centos7自带的有一个mariadb-lib的包,会和server包的内容冲突,因此要先删掉自带的mariadb-lib

  1. [root@hadoop102 mysql-libs]# yum remove mariadb-libs-5.5.65-1.el7.x86_64

之后还有问题,需要装个autoconf库

  1. [root@hadoop102 mysql-libs]# yum -y install autoconf

确保机器上没有MySQL就可以安装了。

  1. [root@hadoop102 mysql-libs]# rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm

安装client

  1. [root@hadoop102 mysql-libs]# rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm

解压connector

  1. [root@hadoop102 mysql-libs]# tar -zxvf mysql-connector-java-5.1.27.tar.gz

启动mysql-server

  1. [root@hadoop102 mysql-libs]# service mysql start

查看状态

  1. [root@hadoop102 mysql-libs]# service mysql status

查看进程和3306端口

  1. [root@hadoop102 mysql-libs]# ps -ef|grep mysql
  2. [root@hadoop102 mysql-libs]# netstat -anop|grep 3306

然后好像可以直接无密码登录,进去后修改密码,再重新登录

  1. [root@hadoop102 mysql-libs]# mysql
  2. mysql> set password=password('123456');
  3. [root@hadoop102 mysql-libs]# mysql -uroot -p123456

进去之后设置,

可以查看数据库

  1. mysql> show databases;

进入mysql表,修改user信息

  1. mysql> use mysql;
  2. mysql> select user,host,password from user;

设置所有可访问

  1. mysql> update user set host='%' where host='localhost';

然后把其他多余的删除,只留下%

  1. mysql> delete from user where Host='hadoop102 ';
  2. mysql> delete from user where Host='127.0.0.1 ';
  3. mysql> delete from user where Host='::1';

退出

  1. mysql> quit;

现在MySQL就可以使用了。

安装sqoop

安装包解压后,重命名配置文件,在/opt/module/sqoop-1.4.6/conf目录下

  1. [root@hadoop102 conf]# mv sqoop-env-template.sh sqoop-env.sh

在sqoop-env.sh文件中增加下面内容

  1. export HADOOP_COMMON_HOME=/opt/module/hadoop-2.8.2
  2. export HADOOP_MAPRED_HOME=/opt/module/hadoop-2.8.2

需要什么环境就加什么,这里加了hadoop。

拷贝JDBC驱动,即安装mysql时的connector

  1. [root@hadoop102 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop-1.4.6/lib/

验证配置是否正确

  1. [root@hadoop102 sqoop-1.4.6]# bin/sqoop help

测试是否能连接数据库

  1. [root@hadoop102 sqoop-1.4.6]# bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 123456

同步策略

  • 全量同步策略
  • 增量同步策略
  • 新增及变化策略
  • 特殊策略

将MySQL中的数据传输到hdfs,通过mysqlTohdfs.sh脚本

  1. #!/bin/bash
  2. sqoop=/opt/module/sqoop-1.4.6/bin/sqoop
  3. if [ -n "$2" ] ;then
  4. do_date=$2
  5. else
  6. do_date=`date -d '-1 day' +%F`
  7. fi
  8. #编写通用数据导入指令,通过第一个参数传入表名,第二个参数传入查询语句,对导入数据使用LZO压缩
  9. #格式,并对LZO压缩文件创建索引
  10. import_data(){
  11. $sqoop import \
  12. --connect jdbc:mysql://hadoop102:3306/gmall \
  13. --username root \
  14. --password 123456 \
  15. --target-dir /origin_data/gmall/db/$1/$do_date \
  16. --delete-target-dir \
  17. --query "$2 and \$CONDITIONS" \
  18. --num-mappers 1 \
  19. --fields-terminated-by '\t' \
  20. --compress \
  21. --compression-codec lzop \
  22. --null-string '\\N' \
  23. --null-non-string '\\N'
  24. hadoop jar /opt/module/hadoop-2.8.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer/origin_data/gmall/db/$1/$do_date
  25. }
  26. #针对不同表格分别调用通用数据导入指令,全量导入数据的表格的查询条件为where 1=1 ,增量导入数
  27. #据的表格的查询条件为当天日期
  28. import_order_info(){
  29. import_data order_info "select
  30. id,
  31. final_total_amount,
  32. order_status,
  33. user_id,
  34. out_trade_no,
  35. create_time,
  36. operate_time,
  37. province_id,
  38. benefit_reduce_amount,
  39. original_total_amount,
  40. feight_fee
  41. from order_info
  42. where (date_format(create_time,'%Y-%m-%d')='$do_date'
  43. or date_format(operate_time,'%Y-%m-%d')='$do_date')"
  44. }
  45. import_coupon_use(){
  46. import_data coupon_use "select
  47. id,
  48. coupon_id,
  49. user_id,
  50. order_id,
  51. coupon_status,
  52. get_time,
  53. using_time,
  54. used_time
  55. from coupon_use
  56. where (date_format(get_time,'%Y-%m-%d')='$do_date'
  57. or date_format(using_time,'%Y-%m-%d')='$do_date'
  58. or date_format(used_time,'%Y-%m-%d')='$do_date')"
  59. }
  60. import_order_status_log(){
  61. import_data order_status_log "select
  62. id,
  63. order_id,
  64. order_status,
  65. operate_time
  66. from order_status_log
  67. where date_format(operate_time,'%Y-%m-%d')='$do_date'"
  68. }
  69. import_activity_order(){
  70. import_data activity_order "select
  71. id,
  72. activity_id,
  73. order_id,
  74. create_time
  75. from activity_order
  76. where date_format(create_time,'%Y-%m-%d')='$do_date'"
  77. }
  78. import_user_info(){
  79. import_data "user_info" "select
  80. id,
  81. name,
  82. birthday,
  83. gender,
  84. email,
  85. user_level,
  86. create_time,
  87. operate_time
  88. from user_info
  89. where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'
  90. or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"
  91. }
  92. import_order_detail(){
  93. import_data order_detail "select
  94. od.id,
  95. order_id,
  96. user_id,
  97. sku_id,
  98. sku_name,
  99. order_price,
  100. sku_num,
  101. od.create_time
  102. from order_detail od
  103. join order_info oi
  104. on od.order_id=oi.id
  105. where DATE_FORMAT(od.create_time,'%Y-%m-%d')='$do_date'"
  106. }
  107. import_payment_info(){
  108. import_data "payment_info" "select
  109. id,
  110. out_trade_no,
  111. order_id,
  112. user_id,
  113. alipay_trade_no,
  114. total_amount,
  115. subject,
  116. payment_type,
  117. payment_time
  118. from payment_info
  119. where DATE_FORMAT(payment_time,'%Y-%m-%d')='$do_date'"
  120. }
  121. import_comment_info(){
  122. import_data comment_info "select
  123. id,
  124. user_id,
  125. sku_id,
  126. spu_id,
  127. order_id,
  128. appraise,
  129. comment_txt,
  130. create_time
  131. from comment_info
  132. where date_format(create_time,'%Y-%m-%d')='$do_date'"
  133. }
  134. import_order_refund_info(){
  135. import_data order_refund_info "select
  136. id,
  137. user_id,
  138. order_id,
  139. sku_id,
  140. refund_type,
  141. refund_num,
  142. refund_amount,
  143. refund_reason_type,
  144. create_time
  145. from order_refund_info
  146. where date_format(create_time,'%Y-%m-%d')='$do_date'"
  147. }
  148. import_sku_info(){
  149. import_data sku_info "select
  150. id,
  151. spu_id,
  152. price,
  153. sku_name,
  154. sku_desc,
  155. weight,
  156. tm_id,
  157. category3_id,
  158. create_time
  159. from sku_info where 1=1"
  160. }
  161. import_base_category1(){
  162. import_data "base_category1" "select
  163. id,
  164. name
  165. from base_category1 where 1=1"
  166. }
  167. import_base_category2(){
  168. import_data "base_category2" "select
  169. id,
  170. name,
  171. category1_id
  172. from base_category2 where 1=1"
  173. }
  174. import_base_category3(){
  175. import_data "base_category3" "select
  176. id,
  177. name,
  178. category2_id
  179. from base_category3 where 1=1"
  180. }
  181. import_base_province(){
  182. import_data base_province "select
  183. id,
  184. name,
  185. region_id,
  186. area_code,
  187. iso_code
  188. from base_province
  189. where 1=1"
  190. }
  191. import_base_region(){
  192. import_data base_region "select
  193. id,
  194. region_name
  195. from base_region
  196. where 1=1"
  197. }
  198. import_base_trademark(){
  199. import_data base_trademark "select
  200. tm_id,
  201. tm_name
  202. from base_trademark
  203. where 1=1"
  204. }
  205. import_spu_info(){
  206. import_data spu_info "select
  207. id,
  208. spu_name,
  209. category3_id,
  210. tm_id
  211. from spu_info
  212. where 1=1"
  213. }
  214. import_favor_info(){
  215. import_data favor_info "select
  216. id,
  217. user_id,
  218. sku_id,
  219. spu_id,
  220. is_cancel,
  221. create_time,
  222. cancel_time
  223. from favor_info
  224. where 1=1"
  225. }
  226. import_cart_info(){
  227. import_data cart_info "select
  228. id,
  229. user_id,
  230. sku_id,
  231. cart_price,
  232. sku_num,
  233. sku_name,
  234. create_time,
  235. operate_time,
  236. is_ordered,
  237. order_time
  238. from cart_info
  239. where 1=1"
  240. }
  241. import_coupon_info(){
  242. import_data coupon_info "select
  243. id,
  244. coupon_name,
  245. coupon_type,
  246. condition_amount,
  247. condition_num,
  248. activity_id,
  249. benefit_amount,
  250. benefit_discount,
  251. create_time,
  252. range_type,
  253. spu_id,
  254. tm_id,
  255. category3_id,
  256. limit_num,
  257. operate_time,
  258. expire_time
  259. from coupon_info
  260. where 1=1"
  261. }
  262. import_activity_info(){
  263. import_data activity_info "select
  264. id,
  265. activity_name,
  266. activity_type,
  267. start_time,
  268. end_time,
  269. create_time
  270. from activity_info
  271. where 1=1"
  272. }
  273. import_activity_rule(){
  274. import_data activity_rule "select
  275. id,
  276. activity_id,
  277. condition_amount,
  278. condition_num,
  279. benefit_amount,
  280. benefit_discount,
  281. benefit_level
  282. from activity_rule
  283. where 1=1"
  284. }
  285. import_base_dic(){
  286. import_data base_dic "select
  287. dic_code,
  288. dic_name,
  289. parent_code,
  290. create_time,
  291. operate_time
  292. from base_dic
  293. where 1=1"
  294. }
  295. import_activity_sku(){
  296. import_data activity_sku "select
  297. id,
  298. activity_id,
  299. sku_id,
  300. create_time
  301. from activity_sku
  302. where 1=1"
  303. }
  304. #对传入的第一个参数进行判断,根据传入参数的不同决定导入哪种表数据,传入first,表示初次执行脚
  305. #本,导入所有表数据;传入all,则导入除地区外的所有表数据
  306. case $1 in
  307. "order_info")
  308. import_order_info
  309. ;;
  310. "base_category1")
  311. import_base_category1
  312. ;;
  313. "base_category2")
  314. import_base_category2
  315. ;;
  316. "base_category3")
  317. import_base_category3
  318. ;;
  319. "order_detail")
  320. import_order_detail
  321. ;;
  322. "sku_info")
  323. import_sku_info
  324. ;;
  325. "user_info")
  326. import_user_info
  327. ;;
  328. "payment_info")
  329. import_payment_info
  330. ;;
  331. "base_province")
  332. import_base_province
  333. ;;
  334. "base_region")
  335. import_base_region
  336. ;;
  337. "base_trademark")
  338. import_base_trademark
  339. ;;
  340. "activity_info")
  341. import_activity_info
  342. ;;
  343. "activity_order")
  344. import_activity_order
  345. ;;
  346. "cart_info")
  347. import_cart_info
  348. ;;
  349. "comment_info")
  350. import_comment_info
  351. ;;
  352. "coupon_info")
  353. import_coupon_info
  354. ;;
  355. "coupon_use")
  356. import_coupon_use
  357. ;;
  358. "favor_info")
  359. import_favor_info
  360. ;;
  361. "order_refund_info")
  362. import_order_refund_info
  363. ;;
  364. "order_status_log")
  365. import_order_status_log
  366. ;;
  367. "spu_info")
  368. import_spu_info
  369. ;;
  370. "activity_rule")
  371. import_activity_rule
  372. ;;
  373. "base_dic")
  374. import_base_dic
  375. ;;
  376. "activity_sku")
  377. import_activity_sku
  378. ;;
  379. "first")
  380. import_base_category1
  381. import_base_category2
  382. import_base_category3
  383. import_order_info
  384. import_order_detail
  385. import_sku_info
  386. import_user_info
  387. import_payment_info
  388. import_base_province
  389. import_base_region
  390. import_base_trademark
  391. import_activity_info
  392. import_activity_order
  393. import_cart_info
  394. import_comment_info
  395. import_coupon_use
  396. import_coupon_info
  397. import_favor_info
  398. import_order_refund_info
  399. import_order_status_log
  400. import_spu_info
  401. import_activity_rule
  402. import_base_dic
  403. import_activity_sku
  404. ;;
  405. "all")
  406. import_base_category1
  407. import_base_category2
  408. import_base_category3
  409. import_order_info
  410. import_order_detail
  411. import_sku_info
  412. import_user_info
  413. import_payment_info
  414. import_base_trademark
  415. import_activity_info
  416. import_activity_order
  417. import_cart_info
  418. import_comment_info
  419. import_coupon_use
  420. import_coupon_info
  421. import_favor_info
  422. import_order_refund_info
  423. import_order_status_log
  424. import_spu_info
  425. import_activity_rule
  426. import_base_dic
  427. import_activity_sku
  428. ;;
  429. esac

脚本要授权。

安装Hive

安装包解压后,将mysql里面的connector放到Hive安装目录中的lib下

  1. [root@hadoop102 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar /opt/module/hive-3.1.2/lib/

在Hive的conf目录下创建hive-site.xml

  1. [root@hadoop102 hive-3.1.2]# pwd
  2. /opt/module/hive-3.1.2
  3. [root@hadoop102 hive-3.1.2]# cd conf/
  4. [root@hadoop102 conf]# vim hive-site.xml

添加如下内容

  1. <?xml version="1.0"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <configuration>
  4. <property>
  5. <name>javax.jdo.option.ConnectionURL</name>
  6. <value>jdbc:mysql://hadoop102:3306/hadoop102?createDatabaseIfN otExist=true</value>
  7. <description>JDBC connect string for a JDBC </description>
  8. </property>
  9. <property>
  10. <name>javax.jdo.option.ConnectionDriverName</name>
  11. <value>com.mysql.jdbc.Driver</value>
  12. <description>Driver class name for a JDBC </description>
  13. </property>
  14. <property>
  15. <name>javax.jdo.option.ConnectionUserName</name>
  16. <value>root</value>
  17. <description>username to use against database</description>
  18. </property>
  19. <property>
  20. <name>javax.jdo.option.ConnectionPassword</name>
  21. <value>123456</value>
  22. <description>password to use against database</description>
  23. </property>
  24. <property>
  25. <name>hive.metastore.warehouse.dir</name>
  26. <value>/user/hive/warehouse</value>
  27. <description>location of default database for the warehouse</description>
  28. </property>
  29. <property>
  30. <name>hive.cli.print.header</name>
  31. <value>true</value>
  32. </property>
  33. <property>
  34. <name>hive.cli.print.current.db</name>
  35. <value>true</value>
  36. </property>
  37. <property>
  38. <name>hive.metastore.schema.verification</name>
  39. <value>false</value>
  40. </property>
  41. <property>
  42. <name>datanucleus.schema.autoCreateAll</name>
  43. <value>true</value>
  44. </property>
  45. <property>
  46. <name>hive.metastore.uris</name>
  47. <value>thrift://hadoop102:9083</value>
  48. </property>
  49. </configuration>

启动服务

  1. [root@hadoop102 hive-3.1.2]# nohup bin/hive --service metastore &
  1. nohup bin/hive --service metastore >/dev/null 2>&1 &

然后正常启动Hive

  1. [root@hadoop102 hive-3.1.2]# bin/hive

集成Tez引擎

将tez压缩包上传一份到hdfs上,再解压一份在本地

  1. [root@hadoop102 software]# hadoop fs -mkdir /tez
  2. [root@hadoop102 software]# hadoop fs -put apache-tez-0.9.1-bin.tar.gz /tez/
  3. [root@hadoop102 software]# tar -zxvf apache-tez-0.9.1-bin.tar.gz -C /opt/module/

在hive安装目录/opt/module/hive-3.1.2/conf下创建tez-site.xml文件

  1. <?xml version="1.0"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <configuration>
  4. <property>
  5. <name>tez.lib.uris</name>
  6. <value>${fs.defaultFS}/tez/apache-tez-0.9.1-bin.tar.gz</value>
  7. </property>
  8. <property>
  9. <name>tez.use.cluster.hadoop-libs</name>
  10. <value>true</value>
  11. </property>
  12. <property>
  13. <name>tez.history.logging.service.class</name>
  14. <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggin gService</value>
  15. </property>
  16. </configuration>

修改文件名

  1. [root@hadoop102 conf]# mv hive-env.sh.template hive-env.sh

修改hive-env.sh文件,增加下面内容

  1. export TEZ_HOME=/opt/module/tez-0.9.1 #是你的 tez 的解压目录
  2. export TEZ_JARS=""
  3. for jar in `ls $TEZ_HOME |grep jar`; do
  4. export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
  5. done
  6. for jar in `ls $TEZ_HOME/lib`; do
  7. export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
  8. done
  9. export HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.8.2/share/hadoop/common/hadoop-lzo-0.4.20.jar$TEZ_JARS

在Hive配置/opt/module/hive-3.1.2/conf中的hive-site.xml文件添加下面配置,更改引擎

  1. <property>
  2. <name>hive.execution.engine</name>
  3. <value>tez</value>
  4. </property>

到这里理论上就没问题了,但是一般情况下会遇到运行 Tez 时检查到用过多内存而被 NodeManager 杀死进程问题,解决方法是关掉虚拟内存检查,修改 yarn-site.xml

分发到其他机器,重新启动集群。