- 资源规划
- 1. 软件包准备
- !/usr/bin/env bash
- mysql_v=5.7.30
- mysql_connector_v=5.1.40
- zookeeper_v=3.5.6
- kafka_v=2.3.1
- tez_v=0.9.2
- flink_v=1.10.0
- font color
- No Color
- http://mirrors.tuna.tsinghua.edu.cn“">mirror_prefix=”http://mirrors.tuna.tsinghua.edu.cn“
- https://mirror.bit.edu.cn“">mirror_prefix=”https://mirror.bit.edu.cn“
- hadoop_url=${mirror_prefix}/apache/hadoop/common/hadoop-${hadoop_v}/hadoop-${hadoop_v}.tar.gz
- spark_url=${mirror_prefix}/apache/spark/spark-${spark_v}/spark-${spark_v}-bin-hadoop2.7.tgz
- zookeeper_url=${mirror_prefix}/apache/zookeeper/zookeeper-${zookeeper_v}/apache-zookeeper-${zookeeper_v}-bin.tar.gz
- kafkaurl=${mirror_prefix}/apache/kafka/2.3.1/kafka${scala_v}-${kafka_v}.tgz
- hive_url=${mirror_prefix}/apache/hive/hive-${hive_v}/apache-hive-${hive_v}-bin.tar.gz
- tez_url=${mirror_prefix}/apache/tez/${tez_v}/apache-tez-${tez_v}-bin.tar.gz
- flinkurl=${mirror_prefix}/apache/flink/flink-${flink_v}/flink-${flink_v}-bin-scala${scala_v}.tgz
- https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-7.0/flink-shaded-hadoop-2-uber-2.8.3-7.0.jar“">flink_required_jar=”https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-7.0/flink-shaded-hadoop-2-uber-2.8.3-7.0.jar“
- DIY url
- https://cdn.mysql.com/archives/mysql-5.7/mysql-${mysql_v}-linux-glibc2.12-x86_64.tar.gz">mysql_url=https://cdn.mysql.com/archives/mysql-5.7/mysql-${mysql_v}-linux-glibc2.12-x86_64.tar.gz
- download “${mysql_url}” “mysql-${mysql_v}-linux-glibc2.12-x86_64.tar.gz”
- 启动并进入容器
- 查看系统版本
- 关注:DISTRIB_CODENAME,用于配置软件源
- Ubuntu 12.04 (LTS)代号为“precise”
- Ubuntu 14.04 (LTS)代号为“trusty”
- Ubuntu 15.04 代号为“vivid”
- Ubuntu 15.10 代号为“wily”
- Ubuntu 16.04 (LTS)代号为“xenial”
- Ubuntu 18.04 (LTS)代号为“bionic”
- 源文件备份(资源必须与系统版本匹配,可通过上述命令确定或者查看原来的sources.list中的版本信息)
- 设置软件源
- aliyun
- ubuntu
- ubuntu other
- 2. Hive配置
- 3. Spark配置
- 4. Zeppelin配置
- 5. pip配置(选)
- 6. Azkaban驱动配置
- 4. 集群启动脚本准备
- 5. 镜像构建
- 6. 容器构建脚本
- 7. Livy安装(选)
- 查看会话列表
- 创建一个会话
- 查看任务结果
- 删除会话
- 在指定会话上执行一段代码
- 8. 运行测试
- 进入容器
- 拷贝依赖
- 重启Zeppelin
- 9. 备份
- 参考
资源规划
- 容器规划
一共7个节点,即启动7个容器。“hadoop-master”、“hadoop-node1”、“hadoop-node2”这三个容器里面安装Hadoop和Spark集群,“hive”容器安装Hive,“mysql”容器安装MySQL数据库,“zeppelin”容器安装Zeppelin,“azkaban”容器安装作业调度器Azkaban。
点击查看【processon】
注意:容器的hostname都是以hadoop-*开头,因为我们要配置容器之间的SSH密钥登陆,在不生成id_rsa.pub公钥的条件下,我们可以通过配置SSH过滤规则来配置容器间的互通信。
容器 | 主机名 | 组件 |
---|---|---|
hadoop-master | hadoop-master | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Hadoop-2.7.3(NameNode)、Hadoop-2.7.3(SecondaryNameNode)、Hadoop-2.7.3(ResourceManager)、Spark-2.3.0(Master) |
hadoop-node1 | hadoop-node1 | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Hadoop-2.7.3(DataNode)、Hadoop-2.7.3(NodeManager)、Spark-2.3.0(Worker) |
hadoop-node2 | hadoop-node2 | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Hadoop-2.7.3(DataNode)、Hadoop-2.7.3(NodeManager)、Spark-2.3.0(Worker) |
hive | hadoop-hive | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Hive-2.3.2 |
mysql | hadoop-mysql | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、MySQL-5.5.45 |
zeppelin | hadoop-zeppelin | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Zeppelin-0.8.2 |
azkaban | hadoop-azkaban | ubuntu:18.04、JDK-1.8.221、Scala-2.11.8、Azkaban-3.90.0 |
- 目录规划
目录详情:mkdir -p /root/docker/bigdata && cd /root/docker/bigdata
mkdir -p /root/docker/bigdata/config
mkdir -p /root/docker/bigdata/software
mkdir -p /root/docker/bigdata/scripts
目录/文件 | 描述 |
---|---|
/root/docker/bigdata | 根目录 |
/root/docker/bigdata/config | 配置目录 |
/root/docker/bigdata/config/ssh_config | 配置容器使用SSH免密登录 |
/root/docker/bigdata/config/profile | 系统环境变量 |
/root/docker/bigdata/config/master | Hadoop主节点(Master)声明文件 |
/root/docker/bigdata/config/slaves | Hadoop从节点声明文件 |
/root/docker/bigdata/config/zeppelin-env.sh | Zeppelin环境变量 |
/root/docker/bigdata/config/zeppelin-site.xml | Zeppelin配置(http端口默认:8080,这里改为:18080) |
/root/docker/bigdata/config/hadoop-env.sh | 声明Hadoop需要的环境变量 |
/root/docker/bigdata/config/hdfs-site.xml | 设置Hadoop的NameNode及DataNode。NameNode存储的是元数据,DataNode存储的是数据文件。 |
/root/docker/bigdata/config/core-site.xml | 设置主节点的地址及用户访问权限 |
/root/docker/bigdata/config/yarn-site.xml | Yarn设置 |
/root/docker/bigdata/config/mapred-site.xml | MR设置 |
/root/docker/bigdata/config/start-hadoop.sh | Hadoop启动脚本(HDFS格式化及启动、Yarn、Spark) |
/root/docker/bigdata/config/restart-hadoop.sh | Hadoop重启脚本(无需格式化HDFS) |
/root/docker/bigdata/config/init_mysql.sh | MySQL初始化脚本 |
/root/docker/bigdata/config/hive-site.xml | Hive配置 |
/root/docker/bigdata/config/init_hive.sh | Hive初始化脚本 |
/root/docker/bigdata/config/spark-defaults.conf | Spark的默认配置 |
/root/docker/bigdata/config/spark-env.sh | 声明Spark需要的环境变量 |
/root/docker/bigdata/config/pip.conf | 豆瓣pip源配置用于为国内节点pip加速 |
/root/docker/bigdata/software | 软件包目录 |
/root/docker/bigdata/software/jdk-8u221-linux-x64.tar.gz | JDK软件包 |
/root/docker/bigdata/software/scala-2.11.8.tgz | Scala软件包 |
/root/docker/bigdata/software/mysql-5.5.45-linux2.6-x86_64.tar.gz | MySQL软件包 |
/root/docker/bigdata/software/mysql-connector-java-5.1.37.jar | MySQL连接器软件包 |
/root/docker/bigdata/software/hadoop-2.7.3.tar.gz | Hadoop软件包 |
/root/docker/bigdata/software/apache-hive-2.3.2-bin.tar.gz | Hive软件包 |
/root/docker/bigdata/software/spark-2.3.0-bin-hadoop2.7.tgz | Spark软件包 |
/root/docker/bigdata/software/zeppelin-0.8.2-bin-all.tgz | Zeppelin软件包 |
/root/docker/bigdata/software/azkaban-3.90.0-solo-server.fix.build.tar | Azkaban软件包(自行编译) |
/root/docker/bigdata/software/derby.jar | Derby数据库驱动(Azkaban中使用) |
/root/docker/bigdata/Dockerfile | Dockerfile文件 |
/root/docker/bigdata/build.sh | 镜像构建脚本 |
/root/docker/bigdata/sources.list | 操作系统(Ubuntu)软件源 |
/root/docker/bigdata/scripts | 辅助脚本目录 |
/root/docker/bigdata/scripts/build_network.sh | 容器网络构建脚本 |
/root/docker/bigdata/scripts/start_container.sh | 容器启动脚本 |
/root/docker/bigdata/scripts/restart_container.sh | 容器重启脚本 |
/root/docker/bigdata/scripts/stop_container.sh | 容器停止并移除 |
/root/docker/bigdata/download.sh | 联网下载软件包脚本 |
附件:docker_bigdata-without-software_20201218.zip
1. 软件包准备
- 软件包下载
内容如下: ```bashrm -rf /root/docker/bigdata/download.sh
vi /root/docker/bigdata/download.sh
chmod a+x /root/docker/bigdata/download.sh
cd /root/docker/bigdata/
./download.sh
!/usr/bin/env bash
hadoop_v=2.7.3 spark_v=2.3.0 scala_v=2.11.8 hive_v=2.3.2 zeppelin_v=0.8.2 mysql_v=5.5.45
mysql_v=5.7.30
mysql_connector_v=5.1.37
mysql_connector_v=5.1.40
zookeeper_v=3.5.6
kafka_v=2.3.1
tez_v=0.9.2
flink_v=1.10.0
font color
RED=”\033[0;31m” GREEN=”\033[0;32m” BLUE=”\033[0;34m”
No Color
NC=”\033[0m”
mirror_prefix=”http://mirrors.tuna.tsinghua.edu.cn“
mirror_prefix=”https://mirror.bit.edu.cn“
hadoop_url=${mirror_prefix}/apache/hadoop/common/hadoop-${hadoop_v}/hadoop-${hadoop_v}.tar.gz
spark_url=${mirror_prefix}/apache/spark/spark-${spark_v}/spark-${spark_v}-bin-hadoop2.7.tgz
zookeeper_url=${mirror_prefix}/apache/zookeeper/zookeeper-${zookeeper_v}/apache-zookeeper-${zookeeper_v}-bin.tar.gz
kafkaurl=${mirror_prefix}/apache/kafka/2.3.1/kafka${scala_v}-${kafka_v}.tgz
hive_url=${mirror_prefix}/apache/hive/hive-${hive_v}/apache-hive-${hive_v}-bin.tar.gz
tez_url=${mirror_prefix}/apache/tez/${tez_v}/apache-tez-${tez_v}-bin.tar.gz
flinkurl=${mirror_prefix}/apache/flink/flink-${flink_v}/flink-${flink_v}-bin-scala${scala_v}.tgz
flink_required_jar=”https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-7.0/flink-shaded-hadoop-2-uber-2.8.3-7.0.jar“
DIY url
hadoop_url=http://archive.apache.org/dist/hadoop/core/hadoop-${hadoop_v}/hadoop-${hadoop_v}.tar.gz spark_url=http://archive.apache.org/dist/spark/spark-${spark_v}/spark-${spark_v}-bin-hadoop2.7.tgz hive_url=http://archive.apache.org/dist/hive/hive-${hive_v}/apache-hive-${hive_v}-bin.tar.gz zeppelin_url=http://archive.apache.org/dist/zeppelin/zeppelin-${zeppelin_v}/zeppelin-${zeppelin_v}-bin-all.tgz scala_url=https://downloads.lightbend.com/scala/${scala_v}/scala-${scala_v}.tgz
mysql_url=https://cdn.mysql.com/archives/mysql-5.7/mysql-${mysql_v}-linux-glibc2.12-x86_64.tar.gz
mysql_connector_url=https://repo1.maven.org/maven2/mysql/mysql-connector-java/${mysql_connector_v}/mysql-connector-java-${mysql_connector_v}.jar mysql_url=https://cdn.mysql.com/archives/mysql-5.5/mysql-${mysql_v}-linux2.6-x86_64.tar.gz
colorful_echo() { echo -e “${1}${2}${NC}” }
download() { if [ -f “software/$2” ]; then colorful_echo ${RED} “$2 already exits, skip it.” else colorful_echo ${GREEN} “Start downloading $2 from $1 …” wget $1 -O “software/$2” > /dev/null 2>&1 colorful_echo ${BLUE} “$2 downloaded successfully.” fi }
download “${hadoop_url}” “hadoop-${hadoop_v}.tar.gz” download “${spark_url}” “spark-${spark_v}-bin-hadoop2.7.tgz” download “${scala_url}” “scala-${scala_v}.tgz” download “${hive_url}” “apache-hive-${hive_v}-bin.tar.gz” download “${zeppelin_url}” “zeppelin-${zeppelin_v}-bin-all.tgz”
download “${mysql_url}” “mysql-${mysql_v}-linux-glibc2.12-x86_64.tar.gz”
download “${mysql_url}” “mysql-${mysql_v}-linux2.6-x86_64.tar.gz” download “${mysql_connector_url}” “mysql-connector-java-${mysql_connector_v}.jar”
colorful_echo ${GREEN} ‘Download the successful completion!’
<a name="f3PUq"></a>
# 2. 获取镜像
<a name="f0kIX"></a>
## 1. 获取官方镜像
```bash
# 查看可用的稳定版本
docker search ubuntu
# 版本:ubuntu:18.04;镜像大小:63.2MB;
docker pull ubuntu:18.04
docker image ls |grep ubuntu
2. 更新软件源(选)
- 方式1(修改软件源并提交生成新的镜像)
```bash
启动并进入容器
docker run —name ubuntu -it -d ubuntu:18.04 docker exec -it ubuntu /bin/bash
查看系统版本
cat /etc/issue
关注:DISTRIB_CODENAME,用于配置软件源
Ubuntu 12.04 (LTS)代号为“precise”
Ubuntu 14.04 (LTS)代号为“trusty”
Ubuntu 15.04 代号为“vivid”
Ubuntu 15.10 代号为“wily”
Ubuntu 16.04 (LTS)代号为“xenial”
Ubuntu 18.04 (LTS)代号为“bionic”
cat /etc/lsb-release
cd /etc/apt
源文件备份(资源必须与系统版本匹配,可通过上述命令确定或者查看原来的sources.list中的版本信息)
mv sources.list sources.list.bak
设置软件源
echo ‘# aliyun’ > sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list
echo ‘# ubuntu’ >> sources.list echo deb http://archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb http://archive.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list echo deb-src http://archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse >> sources.list echo deb-src http://archive.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse >> sources.list echo deb-src http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse >> sources.list echo deb-src http://archive.ubuntu.com/ubuntu/ bionic-proposed main restricted universe multiverse >> sources.list echo deb-src http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse >> sources.list
echo ‘# ubuntu other’ >> sources.list echo deb http://archive.canonical.com/ubuntu/ bionic partner >> sources.list
提交镜像:
```bash
exit
docker commit -a "polaris<450733605@qq.com>" -m "Basic Image ubuntu:18.04" ubuntu my-ubuntu:v18.04
- 方式2(通过Dockerfile的ADD方式将软件源信息覆盖到基础镜像中)
内容如下: ```rm -rf /root/docker/bigdata/sources.list vi /root/docker/bigdata/sources.list
aliyun
deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse
ubuntu
deb http://archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main restricted universe multiverse deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ bionic-proposed main restricted universe multiverse deb-src http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
ubuntu other
deb http://archive.canonical.com/ubuntu/ bionic partner
<a name="i1wuf"></a>
# 3. 集群配置文件准备
<a name="hXVQz"></a>
## 1. Hadoop配置
<a name="fsecK"></a>
### 1. hadoop-env.sh
声明Hadoop需要的环境变量。
```bash
vi /root/docker/bigdata/config/hadoop-env.sh
内容如下:
export JAVA_HOME=/usr/local/jdk1.8.0_221
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_IDENT_STRING=$USER
2. hdfs-site.xml
主要设置Hadoop的NameNode及DataNode。NameNode存储的是元数据,DataNode存储的是数据文件。
vi /root/docker/bigdata/config/hdfs-site.xml
内容如下:
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
3. core-site.xml
设置主节点的地址:hadoop-master。注意:后面启动容器时,设置的主节点hostname要保持一致。
vi /root/docker/bigdata/config/core-site.xml
内容如下:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>httpfs.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>httpfs.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.livy.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.livy.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
</configuration>
4. yarn-site.xml
vi /root/docker/bigdata/config/yarn-site.xml
内容如下:
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>22528</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>
</configuration>
5. mapred-site.xml
vi /root/docker/bigdata/config/mapred-site.xml
内容如下:
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master:10020</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/stage</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
</property>
</configuration>
6. master
主节点声明文件。
vi /root/docker/bigdata/config/master
内容如下:
hadoop-master
2. Hive配置
主要用于:
- “hive.server2.transport.mode”设为binary,使其支持JDBC连接。
- 设置MySQL的地址。
内容如下:vi /root/docker/bigdata/config/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.warehouse.dir</name> <value>/home/hive/warehouse</value> </property> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://hadoop-hive:9083</value> </property> <property> <name>hive.server2.transport.mode</name> <value>binary</value> </property> <property> <name>hive.server2.thrift.http.port</name> <value>10001</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop-mysql:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> </configuration>
3. Spark配置
1. spark-env.sh
声明Spark需要的环境变量。
内容如下:vi /root/docker/bigdata/config/spark-env.sh
SPARK_MASTER_WEBUI_PORT=8888 export SPARK_HOME=$SPARK_HOME export HADOOP_HOME=$HADOOP_HOME export MASTER=spark://hadoop-master:7077 export SCALA_HOME=$SCALA_HOME export SPARK_MASTER_HOST=hadoop-master export JAVA_HOME=/usr/local/jdk1.8.0_221 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
2. spark-defaults.conf
Spark的默认配置。
内容如下:vi /root/docker/bigdata/config/spark-defaults.conf
spark.executor.memory=2G spark.driver.memory=2G spark.executor.cores=2 #spark.sql.codegen.wholeStage=false #spark.memory.offHeap.enabled=true #spark.memory.offHeap.size=4G #spark.memory.fraction=0.9 #spark.memory.storageFraction=0.01 #spark.kryoserializer.buffer.max=64m #spark.shuffle.manager=sort #spark.sql.shuffle.partitions=600 spark.speculation=true spark.speculation.interval=5000 spark.speculation.quantile=0.9 spark.speculation.multiplier=2 spark.default.parallelism=1000 spark.driver.maxResultSize=1g #spark.rdd.compress=false spark.task.maxFailures=8 spark.network.timeout=300 spark.yarn.max.executor.failures=200 spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true spark.dynamicAllocation.minExecutors=4 spark.dynamicAllocation.maxExecutors=8 spark.dynamicAllocation.executorIdleTimeout=60 #spark.serializer=org.apache.spark.serializer.JavaSerializer #spark.sql.adaptive.enabled=true #spark.sql.adaptive.shuffle.targetPostShuffleInputSize=100000000 #spark.sql.adaptive.minNumPostShufflePartitions=1 # for spark2.0 #spark.sql.hive.verifyPartitionPath=true #spark.sql.warehouse.dir spark.sql.warehouse.dir=/spark/warehouse
3. slaves
从节点声明文件。
内容如下:vi /root/docker/bigdata/config/slaves
hadoop-node1 hadoop-node2
4. Zeppelin配置
1. zeppelin-env.sh
内容如下:vi /root/docker/bigdata/config/zeppelin-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_221 export MASTER=spark://hadoop-master:7077 export SPARK_HOME=$SPARK_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
2. zeppelin-site.xml
http端口默认:8080,这里改为:18080。为方便加载第三方包,mvnRepo也改为阿里的源。
内容如下:vi /root/docker/bigdata/config/zeppelin-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>zeppelin.server.addr</name> <value>0.0.0.0</value> </property> <property> <name>zeppelin.server.port</name> <value>18080</value> </property> <property> <name>zeppelin.server.ssl.port</name> <value>18443</value> </property> <property> <name>zeppelin.server.context.path</name> <value>/</value> </property> <property> <name>zeppelin.war.tempdir</name> <value>webapps</value> </property> <property> <name>zeppelin.notebook.dir</name> <value>notebook</value> </property> <property> <name>zeppelin.notebook.homescreen</name> <value></value> </property> <property> <name>zeppelin.notebook.homescreen.hide</name> <value>false</value> </property> <property> <name>zeppelin.notebook.storage</name> <value>org.apache.zeppelin.notebook.repo.GitNotebookRepo</value> </property> <property> <name>zeppelin.notebook.one.way.sync</name> <value>false</value> </property> <property> <name>zeppelin.interpreter.dir</name> <value>interpreter</value> </property> <property> <name>zeppelin.interpreter.localRepo</name> <value>local-repo</value> </property> <property> <name>zeppelin.interpreter.dep.mvnRepo</name> <value>http://maven.aliyun.com/nexus/content/groups/public/</value> </property> <property> <name>zeppelin.dep.localrepo</name> <value>local-repo</value> </property> <property> <name>zeppelin.helium.node.installer.url</name> <value>https://nodejs.org/dist/</value> </property> <property> <name>zeppelin.helium.npm.installer.url</name> <value>http://registry.npmjs.org/</value> </property> <property> <name>zeppelin.helium.yarnpkg.installer.url</name> <value>https://github.com/yarnpkg/yarn/releases/download/</value> </property> <property> <name>zeppelin.interpreters</name> <value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.rinterpreter.RRepl,org.apache.zeppelin.rinterpreter.KnitR,org.apache.zeppelin.spark.SparkRInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,,org.apache.zeppelin.python.PythonInterpreter,org.apache.zeppelin.python.PythonInterpreterPandasSql,org.apache.zeppelin.python.PythonCondaInterpreter,org.apache.zeppelin.python.PythonDockerInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter,org.apache.zeppelin.livy.LivySparkInterpreter,org.apache.zeppelin.livy.LivyPySparkInterpreter,org.apache.zeppelin.livy.LivyPySpark3Interpreter,org.apache.zeppelin.livy.LivySparkRInterpreter,org.apache.zeppelin.livy.LivySparkSQLInterpreter,org.apache.zeppelin.bigquery.BigQueryInterpreter,org.apache.zeppelin.beam.BeamInterpreter,org.apache.zeppelin.pig.PigInterpreter,org.apache.zeppelin.pig.PigQueryInterpreter,org.apache.zeppelin.scio.ScioInterpreter,org.apache.zeppelin.groovy.GroovyInterpreter</value> </property> <property> <name>zeppelin.interpreter.group.order</name> <value>spark,md,angular,sh,livy,alluxio,file,psql,flink,python,ignite,lens,cassandra,geode,kylin,elasticsearch,scalding,jdbc,hbase,bigquery,beam,groovy</value> </property> <property> <name>zeppelin.interpreter.connect.timeout</name> <value>30000</value> </property> <property> <name>zeppelin.interpreter.output.limit</name> <value>102400</value> </property> <property> <name>zeppelin.ssl</name> <value>false</value> </property> <property> <name>zeppelin.ssl.client.auth</name> <value>false</value> </property> <property> <name>zeppelin.ssl.keystore.path</name> <value>keystore</value> </property> <property> <name>zeppelin.ssl.keystore.type</name> <value>JKS</value> </property> <property> <name>zeppelin.ssl.keystore.password</name> <value>change me</value> </property> <property> <name>zeppelin.ssl.truststore.path</name> <value>truststore</value> </property> <property> <name>zeppelin.ssl.truststore.type</name> <value>JKS</value> </property> <property> <name>zeppelin.server.allowed.origins</name> <value>*</value> </property> <property> <name>zeppelin.anonymous.allowed</name> <value>true</value> </property> <property> <name>zeppelin.username.force.lowercase</name> <value>false</value> </property> <property> <name>zeppelin.notebook.default.owner.username</name> <value></value> </property> <property> <name>zeppelin.notebook.public</name> <value>true</value> </property> <property> <name>zeppelin.websocket.max.text.message.size</name> <value>1024000</value> </property> <property> <name>zeppelin.server.default.dir.allowed</name> <value>false</value> </property> </configuration>
5. pip配置(选)
豆瓣pip源配置用于为国内节点pip加速。
内容如下:vi /root/docker/bigdata/config/pip.conf
[global] index-url = http://pypi.douban.com/simple trusted-host = pypi.douban.com
6. Azkaban驱动配置
由于jdk-8u121-linux-x64.tar.gz之后很多版本中不含db\lib\derby.jar,则运行Azkaban工程将抛出如下异常:Exception in thread “main” java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.AutoloadedDriver40
解决方法(Dockerfile中已处理):
解压jdk-8u121-linux-x64.tar.gz,找到derby.jar,然后上传到${AZKABAN-HOME}/lib中,然后重新启动。4. 集群启动脚本准备
1. 环境变量
内容如下: ```bash export JAVA_HOME=/usr/local/jdk1.8.0_221 export SCALA_HOME=/usr/local/scala-2.11.8 export HADOOP_HOME=/usr/local/hadoop-2.7.3 export SPARK_HOME=/usr/local/spark-2.3.0-bin-hadoop2.7 export HIVE_HOME=/usr/local/apache-hive-2.3.2-bin export MYSQL_HOME=/usr/local/mysqlvi /root/docker/bigdata/config/profile
export PATH=$HIVE_HOME/bin:$MYSQL_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
<a name="uRN5y"></a>
## 2. SSH配置
各个容器需要通过网络端口连接在一起,为方便连接访问,使用SSH无验证登录。
```bash
vi /root/docker/bigdata/config/ssh_config
内容如下:
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
3. Hadoop集群脚本
1. 启动脚本
vi /root/docker/bigdata/config/start-hadoop.sh
内容如下:
#!/bin/bash
hdfs namenode -format -force
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$SPARK_HOME/sbin/start-all.sh
hdfs dfs -mkdir /mr-history
hdfs dfs -mkdir /stage
2. 重启脚本
vi /root/docker/bigdata/config/restart-hadoop.sh
内容如下:
#!/bin/bash
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$SPARK_HOME/sbin/start-all.sh
hdfs dfs -mkdir /mr-history
hdfs dfs -mkdir /stage
4. MySQL脚本
MySQL初始化脚本。
vi /root/docker/bigdata/config/init_mysql.sh
内容如下:
#!/bin/bash
cd /usr/local/mysql/
echo ..........mysql_install_db --user=root.................
nohup ./scripts/mysql_install_db --user=root &
sleep 3
echo ..........mysqld_safe --user=root.................
nohup ./bin/mysqld_safe --user=root &
sleep 3
echo ..........mysqladmin -u root password 'root'.................
nohup ./bin/mysqladmin -u root password 'root' &
sleep 3
echo ..........mysqladmin -uroot -proot shutdown.................
nohup ./bin/mysqladmin -uroot -proot shutdown &
sleep 3
echo ..........mysqld_safe.................
nohup ./bin/mysqld_safe --user=root &
sleep 3
echo ...........................
nohup ./bin/mysql -uroot -proot -e "grant all privileges on *.* to root@'%' identified by 'root' with grant option;"
sleep 3
echo ........grant all privileges on *.* to root@'%' identified by 'root' with grant option...............
5. Hive脚本
Hive初始化脚本。
vi /root/docker/bigdata/config/init_hive.sh
内容如下:
#!/bin/bash
cd /usr/local/apache-hive-2.3.2-bin/bin
# 非首次次执行会报错,故重启脚本无影响
nohup ./schematool -initSchema -dbType mysql &
sleep 3
nohup ./hive --service metastore &
sleep 3
nohup ./hive --service hiveserver2 &
sleep 5
echo Hive has initiallized!
5. 镜像构建
1. 编写Dockfile
rm -rf /root/docker/bigdata/Dockerfile
vi /root/docker/bigdata/Dockerfile
内容如下:
FROM ubuntu:18.04
MAINTAINER "polaris<450733605@qq.com>"
# 环境变量(JAVA、Hadoop、Scala、Spark、Zeppelin、MySQL、Hive)
ENV BUILD_ON 2020-11-19
ENV JAVA_HOME /usr/local/jdk1.8.0_221
ENV HADOOP_HOME /usr/local/hadoop-2.7.3
ENV SCALA_HOME /usr/local/scala-2.11.8
ENV SPARK_HOME /usr/local/spark-2.3.0-bin-hadoop2.7
ENV ZEPPELIN_HOME /usr/local/zeppelin-0.8.2-bin-all
ENV MYSQL_HOME /usr/local/mysql
ENV HIVE_HOME /usr/local/apache-hive-2.3.2-bin
ENV AZKABAN_HOME /usr/local/azkaban-solo-server
# 将环境变量添加到系统变量中
ENV PATH $HIVE_HOME/bin:$MYSQL_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$ZEPPELIN_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$AZKABAN_HOME/bin:$PATH
# 更新软件源
ADD sources.list /etc/apt/sources.list
# 将本地配置文件拷贝到镜像
COPY config /tmp
# 系统更新
RUN mkdir -p ~/.pip/ && \
mv /tmp/pip.conf ~/.pip/pip.conf && \
apt-get update && \
apt-get install -y netcat-traditional vim wget net-tools iputils-ping openssh-server libaio-dev apt-utils && \
apt-get clean all
# Python环境安装
#RUN apt-get -qqy install python-pip && \
# pip install pandas numpy matplotlib sklearn seaborn scipy tensorflow gensim
# 添加JDK、Hadoop、Scala、Spark、Zeppelin、MySQL、Hive、mysql-connector-java-*-bin.jar、Azkaban
ADD ./software/jdk-8u221-linux-x64.tar.gz /usr/local/
ADD ./software/hadoop-2.7.3.tar.gz /usr/local/
ADD ./software/scala-2.11.8.tgz /usr/local/
ADD ./software/spark-2.3.0-bin-hadoop2.7.tgz /usr/local/
ADD ./software/zeppelin-0.8.2-bin-all.tgz /usr/local/
ADD ./software/mysql-5.5.45-linux2.6-x86_64.tar.gz /usr/local/
ADD ./software/apache-hive-2.3.2-bin.tar.gz /usr/local/
ADD ./software/mysql-connector-java-5.1.37.jar /usr/local/apache-hive-2.3.2-bin/lib/
ADD ./software/azkaban-3.90.0-solo-server.fix.build.tar /usr/local/
ADD ./software/derby.jar /usr/local/azkaban-solo-server/lib/
# Hadoop & Hive & Spark & MySQL集成
RUN echo "HADOOP_HOME=/usr/local/hadoop-2.7.3" | cat >> /usr/local/apache-hive-2.3.2-bin/conf/hive-env.sh && \
mv /usr/local/mysql-5.5.45-linux2.6-x86_64 /usr/local/mysql && \
cp /usr/local/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.37.jar /usr/local/spark-2.3.0-bin-hadoop2.7/jars
# 生成密钥文件
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 600 ~/.ssh/authorized_keys
# 将配置移动到正确的位置
RUN mv /tmp/ssh_config ~/.ssh/config && \
mv /tmp/profile /etc/profile && \
cp /tmp/slaves $SPARK_HOME/conf/ && \
mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \
mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \
mv /tmp/zeppelin-env.sh $ZEPPELIN_HOME/conf/zeppelin-env.sh && \
mv /tmp/zeppelin-site.xml $ZEPPELIN_HOME/conf/zeppelin-site.xml && \
cp /tmp/hive-site.xml $SPARK_HOME/conf/hive-site.xml && \
mv /tmp/hive-site.xml $HIVE_HOME/conf/hive-site.xml && \
mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \
mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mkdir -p /usr/local/hadoop2.7/dfs/data && \
mkdir -p /usr/local/hadoop2.7/dfs/name && \
mv /tmp/init_mysql.sh ~/init_mysql.sh && chmod 700 ~/init_mysql.sh && \
mv /tmp/init_hive.sh ~/init_hive.sh && chmod 700 ~/init_hive.sh && \
chmod 700 $AZKABAN_HOME/bin/*.sh && \
mv /tmp/restart-hadoop.sh ~/restart-hadoop.sh && chmod 700 ~/restart-hadoop.sh
# 设置工作目录
WORKDIR /root
# 端口暴露
EXPOSE 8443 8081 5005
# 1.创建Zeppelin环境需要的目录,设置在zeppelin-env.sh中
# 2.启动sshd服务
# 3.修改Hadoop启动脚本权限
# 4.修改超级管理员密码
RUN mkdir /var/log/zeppelin && \
mkdir /var/run/zeppelin && \
mkdir /var/tmp/zeppelin && \
/etc/init.d/ssh start && \
chmod 700 start-hadoop.sh && \
echo "root:123456" | chpasswd
CMD ["/bin/bash"]
2. 编写镜像构建脚本
vi /root/docker/bigdata/build.sh
内容如下:
#!/usr/bin/env bash
echo build Bigdata All-in-One images
docker build -t bigdata-all-in-one:v1.0.0 .
3. 构建镜像
cd /root/docker/bigdata/
chmod a+x build.sh
./build.sh
# 查看镜像,镜像大小(4.21G)
docker images bigdata-all-in-one:v1.0.0
6. 容器构建脚本
1. 创建网络
所有的网络,通过内网连接,这里构建一个名为spark的子网。
vi /root/docker/bigdata/scripts/build_network.sh
内容如下:
echo create network
docker network create --subnet=172.16.0.0/16 bigdata
echo create success
docker network ls |grep bigdata
2. 容器启动脚本
rm -rf /root/docker/bigdata/scripts/start_container.sh
vi /root/docker/bigdata/scripts/start_container.sh
内容如下:
echo "start hadoop-mysql container ..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.6 --privileged --name mysql --hostname hadoop-mysql --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-master:172.16.0.2 --add-host zeppelin:172.16.0.7 bigdata-all-in-one:v1.0.0 /bin/bash
echo "start hadoop-hive container..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.5 --privileged --name hive --hostname hadoop-hive --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-master:172.16.0.2 --add-host zeppelin:172.16.0.7 -v /data/cib/:/data/cib/ -v /usr/local/include/CIB_POC/:/usr/local/include/CIB_POC/ bigdata-all-in-one:v1.0.0 /bin/bash
echo "start hadoop-master container ..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.2 --privileged -p 18032:8032 -p 28080:18080 -p 29888:19888 -p 17077:7077 -p 51070:50070 -p 18888:8888 -p 19000:9000 -p 11100:11000 -p 51030:50030 -p 18050:8050 -p 18081:8081 -p 18900:8900 -p 18088:8088 --name hadoop-master --hostname hadoop-master --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-mysql:172.16.0.6 --add-host zeppelin:172.16.0.7 -v /data/cib/:/data/cib/ -v /usr/local/include/CIB_POC/:/usr/local/include/CIB_POC/ bigdata-all-in-one:v1.0.0 /bin/bash
echo "start hadoop-node1 container..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.3 --privileged -p 18042:8042 -p 51010:50010 -p 51020:50020 --name hadoop-node1 --hostname hadoop-node1 --add-host hadoop-hive:172.16.0.5 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-master:172.16.0.2 --add-host hadoop-node2:172.16.0.4 --add-host zeppelin:172.16.0.7 -v /data/cib/:/data/cib/ -v /usr/local/include/CIB_POC/:/usr/local/include/CIB_POC/ bigdata-all-in-one:v1.0.0 /bin/bash
echo "start hadoop-node2 container..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.4 --privileged -p 18043:8042 -p 51011:50011 -p 51021:50021 --name hadoop-node2 --hostname hadoop-node2 --add-host hadoop-master:172.16.0.2 --add-host hadoop-node1:172.16.0.3 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-hive:172.16.0.5 --add-host zeppelin:172.16.0.7 -v /data/cib/:/data/cib/ -v /usr/local/include/CIB_POC/:/usr/local/include/CIB_POC/ bigdata-all-in-one:v1.0.0 /bin/bash
echo "start Zeppeline container..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.7 --privileged -p 38080:18080 -p 38443:18443 --name zeppelin --hostname hadoop-zeppelin --add-host hadoop-master:172.16.0.2 --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-mysql:172.16.0.6 --add-host hadoop-hive:172.16.0.5 bigdata-all-in-one:v1.0.0 /bin/bash
echo "start Azkaban container..."
docker run -itd --restart=always --net bigdata --ip 172.16.0.8 --privileged -p 8081:8081 --name azkaban --hostname hadoop-azkaban --add-host hadoop-master:172.16.0.2 --add-host hadoop-node1:172.16.0.3 --add-host hadoop-node2:172.16.0.4 --add-host hadoop-hive:172.16.0.5 bigdata-all-in-one:v1.0.0 /bin/bash
echo start sshd...
docker exec -it hadoop-master /etc/init.d/ssh start
docker exec -it hadoop-node1 /etc/init.d/ssh start
docker exec -it hadoop-node2 /etc/init.d/ssh start
docker exec -it hive /etc/init.d/ssh start
docker exec -it mysql /etc/init.d/ssh start
docker exec -it zeppelin /etc/init.d/ssh start
docker exec -it azkaban /etc/init.d/ssh start
echo start service...
docker exec -it mysql bash -c "sh ~/init_mysql.sh"
docker exec -it hadoop-master bash -c "sh ~/start-hadoop.sh"
docker exec -it hive bash -c "sh ~/init_hive.sh"
docker exec -it zeppelin bash -c "/usr/local/zeppelin-0.8.2-bin-all/bin/zeppelin-daemon.sh start"
docker exec -itd azkaban bash -c "source /etc/profile && cd /usr/local/azkaban-solo-server/ && ./bin/start-solo.sh && tail -f /dev/null"
echo finished
docker ps |grep -E 'hadoop-master|hadoop-node1|hadoop-node2|hive|mysql|zeppelin|azkaban'
3. 容器重启脚本
rm -rf /root/docker/bigdata/scripts/restart_container.sh
vi /root/docker/bigdata/scripts/restart_container.sh
内容如下:
echo restart container...
docker restart hadoop-master
docker restart hadoop-node1
docker restart hadoop-node2
docker restart hive
docker restart mysql
docker restart zeppelin
docker restart azkaban
echo restart sshd...
docker exec -it hadoop-master /etc/init.d/ssh restart
docker exec -it hadoop-node1 /etc/init.d/ssh restart
docker exec -it hadoop-node2 /etc/init.d/ssh restart
docker exec -it hive /etc/init.d/ssh restart
docker exec -it mysql /etc/init.d/ssh restart
docker exec -it zeppelin /etc/init.d/ssh restart
docker exec -it azkaban /etc/init.d/ssh restart
echo restart service...
docker exec -it mysql bash -c "sh ~/init_mysql.sh"
docker exec -it hadoop-master bash -c "sh ~/restart-hadoop.sh"
docker exec -it hive bash -c "sh ~/init_hive.sh"
docker exec -it zeppelin bash -c "/usr/local/zeppelin-0.8.2-bin-all/bin/zeppelin-daemon.sh start"
docker exec -itd azkaban bash -c "source /etc/profile && cd /usr/local/azkaban-solo-server/ && ./bin/start-solo.sh && tail -f /dev/null"
echo finished
docker ps |grep -E 'hadoop-master|hadoop-node1|hadoop-node2|hive|mysql|zeppelin|azkaban'
4. 容器停止并移除
rm -rf /root/docker/bigdata/scripts/stop_container.sh
vi /root/docker/bigdata/scripts/stop_container.sh
内容如下:
docker stop hadoop-master
docker stop hadoop-node1
docker stop hadoop-node2
docker stop hive
docker stop mysql
docker stop zeppelin
docker stop azkaban
echo stop containers
docker rm hadoop-master
docker rm hadoop-node1
docker rm hadoop-node2
docker rm hive
docker rm mysql
docker rm zeppelin
docker rm azkaban
echo rm containers
docker ps -a |grep -E 'hadoop-master|hadoop-node1|hadoop-node2|hive|mysql|zeppelin|azkaban'
7. Livy安装(选)
docker exec -it hadoop-master /bin/bash
cd /opt
1. 安装
1. 编译安装(方式1)
git clone https://github.com/cloudera/livy.git
cd livy
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m -XX:MaxPermSize=512M"
mvn package
mvn -Dmaven.test.skip clean package
2. 离线安装(方式2)
wget https://mirrors.bfsu.edu.cn/apache/incubator/livy/0.7.0-incubating/apache-livy-0.7.0-incubating-bin.zip
unzip apache-livy-0.7.0-incubating-bin.zip
mv apache-livy-0.7.0-incubating-bin livy
2. 配置
环境变量
vi /opt/livy/conf/livy-env.sh
内容如下:
export SPARK_HOME=/usr/local/spark-2.3.0-bin-hadoop2.7 export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop/ export JAVA_HOME=/usr/local/jdk1.8.0_221
Livy配置
vi /opt/livy/conf/livy.conf
内容如下:
livy.server.port = 8998 livy.spark.master = yarn livy.spark.deploy-mode =client livy.server.session.timeout = 1h livy.impersonation.enabled = true
3. 启动
cd /opt/livy # 前台启动 ./bin/livy-server # 后台启动 ./bin/livy-server start ./bin/livy-server stop ./bin/livy-server status
4. 验证
交互式会话 ```bash
查看会话列表
curl 172.16.0.2:8998/sessions
创建一个会话
curl -X POST —data ‘{“kind”: “spark”}’ -H “Content-Type:application/json” 172.16.0.2:8998/sessions | python -m json.tool
查看任务结果
curl 172.16.0.2:8998/sessions/0/statements/{id} curl 172.16.0.2:8998/sessions/0/statements
删除会话
curl 172.16.0.2:8998/sessions/0 -X DELETE curl 172.16.0.2:8998/sessions/0 -X DELETE -H ‘Content-Type: application/json’ | python -m json.tool
在指定会话上执行一段代码
curl 172.16.0.2:8998/sessions/0/statements -X POST -H ‘Content-Type: application/json’ -d ‘{“code”:”sc.parallelize(1 to 2).count()”}’ | python -m json.tool
- **批处理会话**
```bash
# 上传jar包(hadoop-master容器上执行)
hdfs dfs -mkdir -p /tmp/jars
hdfs dfs -put /usr/local/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar /tmp/jars/
# 执行jar包
curl -X POST -H "Content-Type: application/json" -d '{"file":"hdfs:///tmp/jars/spark-examples_2.11-2.3.0.jar","className":"org.apache.spark.examples.SparkPi" , "args": ["10"]}' 172.16.0.2:8998/batches | python -m json.tool
# 查看批量任务
curl 172.16.0.2:8998/batches
curl 172.16.0.2:8998/batches/0 | python -m json.tool
# 查看任务执行日志
curl 172.16.0.2:8998/batches/0/log | python -m json.tool
8. 运行测试
1. 创建子网
cd /root/docker/bigdata/scripts/
chmod a+x build_network.sh
./build_network.sh
docker network ls |grep bigdata
2. 启动容器
cd /root/docker/bigdata/scripts/
# 启动
chmod a+x start_container.sh
./start_container.sh
# 重启
chmod a+x restart_container.sh
./restart_container.sh
# 停止
chmod a+x stop_container.sh
./stop_container.sh
# 查看容器
sudo docker ps -a |grep -E 'hadoop-*|hive|mysql|zeppelin|azkaban'
3. 验证主节点
docker exec -it hadoop-master /bin/bash
# 查看进程
jps
4. 验证子节点
# 验证SSH,进入容器 -> 进入集群子节点
ssh hadoop-node1
# 查看进程
jps
5. Web UI
- NameNode WebUI:http://LTSR003:51070
- ResourceManager WebUI:http://LTSR003:18088
- Spark Web UI:http://LTSR003:18888
- Zeppelin Web UI:http://LTSR003:38080
Azkaban Web UI:http://LTSR003:8081
6. MySQL测试
1. MySQL节点准备
为方便测试,在MySQL节点中,增加点数据
# 进入主节点 docker exec -it hadoop-master /bin/bash # TCP方式远程连接MySQL(-h:<ip>/<hostname>) mysql -uroot -h mysql -proot # 进入数据库节点 ssh hadoop-mysql # 查看MySQL端口 netstat -tunp | grep 3306 mysql -uroot -proot
创建库表 ```sql — 创建数据库 create database zeppelin_test;
— 切换数据库 use zeppelin_test;
— 创建数据表(主键自增) create table user_info(id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,name VARCHAR(16),age INT);
— 模拟数据 insert into user_info(name,age) values(“aaa”,10); insert into user_info(name,age) values(“bbb”,20); insert into user_info(name,age) values(“ccc”,30);
— 查询数据 select * from user_info;
<a name="5OmpK"></a>
### 2. Zeppelin配置
- **上传MySQL驱动**
```bash
# 拷贝依赖
docker cp /root/docker/bigdata/software/mysql-connector-java-5.1.37.jar zeppelin:/usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/
# 重启Zeppelin
docker exec -it zeppelin bash
/usr/local/zeppelin-0.8.2-bin-all/bin/zeppelin-daemon.sh restart
- 配置JDBC Interpreter ``` Interpreter Name : jdbc(默认,直接修改,不做创建) Interpreter group: jdbc
default.driver ====> com.mysql.jdbc.Driver default.url ====> jdbc:mysql://hadoop-mysql:3306/zeppelin_test default.user ====> root default.password ====> root
Dependencies artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/mysql-connector-java-5.1.37.jar
<a name="bJBM2"></a>
### 3. 测试mysql查询
```sql
%jdbc
select * from zeppelin_test.user_info;
7. Hive测试
1. Zeppelin配置
拷贝依赖
cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-jdbc-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-service-rpc-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-cli-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-service-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-common-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/hive-serde-2.3.2.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/ cp -rf /usr/local/apache-hive-2.3.2-bin/lib/guava-14.0.1.jar /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/
重启Zeppelin
/usr/local/zeppelin-0.8.2-bin-all/bin/zeppelin-daemon.sh restart
- **配置JDBC Interpreter**
增加Hive解释器,在jdbc模式修改如下配置。
Interpreter Name : hive(新增) Interpreter group: jdbc
default.driver ====> org.apache.hive.jdbc.HiveDriver default.url ====> jdbc:hive2://hadoop-hive:10000 default.user ====> root default.password ====> root
Dependencies artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-jdbc-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-service-rpc-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-cli-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-service-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-common-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/hive-serde-2.3.2.jar artifact ====> /usr/local/zeppelin-0.8.2-bin-all/interpreter/jdbc/guava-14.0.1.jar
_**注意:对于不同版本的Hive,此处的依赖设置也不一样。比如,Hive-2.3.4只需要配置“hive-jdbc-2.3.4-standalone.jar”一个依赖即可。**_
<a name="KadbY"></a>
### 2. 测试
- **创建数据库**
```sql
%hive
CREATE SCHEMA user_hive
切换数据库
%hive use user_hive
创建库表
%hive create table if not exists user_hive.employee(id int ,name string ,age int)
插入数据
%hive insert into user_hive.employee(id,name,age) values(1,"aaa",10)
查询库表
%hive select * from user_hive.employee
元数据查看
可以从MySQL中的表“hive.DBS”中,查看到刚刚创建的数据库的元信息。
%jdbc
select * from hive.DBS;
- Web UI
- HDFS
# Master和Slave节点均可查看 docker exec -it hadoop-master /bin/bash docker exec -it hadoop-node1 /bin/bash hdfs dfs -ls /home/hive/warehouse/user_hive.db/employee docker exec -it hadoop-master hdfs dfs -cat /home/hive/warehouse/user_hive.db/employee/*
8. Spark测试
进入Zeppelin交互界面,新建一个NoteBook,使用Spark作为默认的解释器(Spark Interpreter)。示例1-库表新增数据
向user_hive.db中写入两条数据。 ```scala import org.apache.spark.sql.{SQLContext, Row} import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType} import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc) val employeeRDD = sc.parallelize(Array(“6 rc 26”,”7 gh 27”)).map(_.split(“ “)) val schema = StructType(List(StructField(“id”, IntegerType, true),StructField(“name”, StringType, true),StructField(“age”, IntegerType, true))) val rowRDD = employeeRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt)) val employeeDataFrame = hiveCtx.createDataFrame(rowRDD, schema) employeeDataFrame.registerTempTable(“tempTable”)
hiveCtx.sql(“insert into user_hive.employee select * from tempTable”)
查看数据( Hive Interpreter):
```sql
%hive
select * from user_hive.employee
示例2-CSV文件加载
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
val bankText = sc.parallelize(
IOUtils.toString(
new URL("http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
查看数据(SparkSQL Interpreter):
%sql
select age,count(1) value
from bank
where age < ${maxAge=30}
group by age
order by age
9. 备份
# 导出镜像(镜像:4.15GB,压缩包:3.9GB)
docker save -o docker-bigdata-all-in-one-1.0.0.save.tar bigdata-all-in-one:v1.0.0
docker save > docker-bigdata-all-in-one-1.0.0.save.tar bigdata-all-in-one:v1.0.0
# 加载镜像
docker load -i docker-bigdata-all-in-one-1.0.0.save.tar
docker load < docker-bigdata-all-in-one-1.0.0.save.tar
# 运行容器,参考章节“7.运行测试”
参考
CSDN:基于Docker的Spark-Hadoop分布式集群之一: 环境搭建
https://blog.csdn.net/wangxw1803/article/details/90481363
博客园:基于Docker的Spark-Hadoop分布式集群之一: 环境搭建
https://www.cnblogs.com/Fordestiny/p/9401161.html
博客园:基于Docker的Spark-Hadoop分布式集群之二: 环境测试
https://www.cnblogs.com/Fordestiny/p/9487303.html
CSDN:基于Docker的Spark环境搭建理论部分
https://blog.csdn.net/zhaohaibo_/article/details/83663047