HBase
HBASE是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统
数据模型
Name Space(命名空间)
命名空间,相当于关系型数据库中database,每个命名空间下有很多表,HBase自带了两个命名空间分别是default个HBase,HBase中存放了Hbase内置的表,而default是用户默认使用的命名空间。Region(表)
类似于关系型数据库中表,HBase在定义表时只需要声明列簇,不需要声明具体的列Row(行)
HBase中每行数据都由一个RowKey和多个(Column)列组成,数据是按照RowKey的字典序排序存储的,而查询时只能根据RowKey进行检索Column(列)
HBase中每个列都是由列簇(Column Family)和列限定符(Column Qualifier)进行限定,
例:info:name,info:age;建表时,只需要指定列簇。Time Stamp(时间戳)
用来标识数据的不同版本Cell(单元)
由{rowkey,column Family:column Qualifier,TimeStamp} 唯一确定的单元,cell中的数据没有类型,是以字节码进行存储
HBase系统架构
Hbase的存储机制
Hbase 是一个面向列的数据库,在表中按行进行排序,表模式定义只能列簇,键值对形式。一个表有多个列簇和一个列簇有多个列;- 表是行的集合
- 行是列簇的集合
- 列簇是列的集合
- 列是键值对的集合
- Hbase系统架构体系图
HBase是一个分布式存储系统,有HMaster和HRegionServer;
Client:使用HBase RPC 机制与HMaster和HRegionServer进行通行;Client与HMaster进行管理类操作,与HRegionServer进行数据读写类操作;
HMaster:有多个节点的Hbase Master,根据zookeeper的Master Election的机制保证总有一个柱节点在运行;
HMaster主要负责Table和Region的管理:- 管理用户对表的增删改查操作(改是put操作,新增一条数据)
- 管理HRegionServer的负载均衡,调整Region分布
- Region Split后,负责新Region的分布
- 在HRegionServer停机后,负责将失效HRegionServer上Region迁移
zookeeper: zookeeper集群存储-ROOT-表地址、HMaster地址;HRegionServer把自己以Ephedral方式注册到Zookeeper中,HMaster随时感知各个HRegionServer的健康状况
HRegionServer: HBase中最核心的模块,主要负责响应用户I/O请求,向HDFS文件系统中读写
通过上图可以了解到,HRegionServer管理很多HRegion对象;client访问hbase上的数据并不需要master参与,master仅仅维护table和region的元数据信息
每个HRegion对应Table中的一个Region,HRegion由多个HStore组成;
一个HRegion(表)有多少个列族就有多少个Store。一个HRegionServer会有多个HRegion和一个HLog
HRegion:
HBase常用命令
- 进入 hbase shell
- 退出 exit
- 查看hbase状态 status
- 创建表 create ‘表明’,’列簇名1’,’列簇名2’,’列簇名N’;
- 查看所有表 list
- 描述表 describe ‘表名’
- 判断表是否存在 exists ‘表名’
- 判断是否禁用启用表 is_enabled ‘表名’ is_disabled ‘表名’
- 添加记录 put ‘表名’,’rowkey’,’列簇:列’,’值’
- 查看记录rowkey 下所有数据 get ‘表名’,’rowkey’
- 查看所有记录 scan ‘表名’
- 查看表中的记录总数 count ‘表名’
- 获取某个列簇(获取某个列族的某个列) get ‘表名’,’rowkey’,’列簇:列’
- 删除记录 delete ‘表名’,’行名’,’列簇:列’
- 删除整行 deleteall ‘表名’,’行名’,’列簇:列’
- 删除一张表 首先屏蔽该表,然后删除该表 disable ‘表名’ drop ‘表名’
- 清空表 truncate ‘表名’
- 查看某个表某个列中所有数据 scan ‘表名’,{COLUMNS=>’列族名:列名’}
Hbase集群安装
安装前准备
- zookeeper-3.4.14.tar.gz 安装包
- hbase-2.2.1-bin.tar.gz安装包
- Hadoop-3.1.2.tar.gz 安装包
- 3台虚拟机
安装Hadoop
请参考Hadoop.md文档,hadoop分布式集群部署安装步骤安装zookeeper
请参考zookeeper.md文档,其中提供zookeeper分布式集群部署安装步骤安装hbase
把hbase-2.2.1-bin.tar.gz安装包分别上传到虚拟机hadoop01、hadoop02、hadoop03上
解压hbase安装包
tar -zxvf hbase-2.2.1-bin.tar.gz
- 配置环境变量(所有节点上的环境变量)
HBASE_HOME=/opt/hbase-2.2.1
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin
export PATH CLASSPATH JAVA_HOME HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME
- 设置集群时间同步 ```shell yum -y install ntp ntpdate #安装ntpdate时间同步工具 sudo systemctl start ntpd #启动时间同步程序 sudo systemctl enable ntpd #允许时间同步程序开机启动
以hadoop01作为时间同步服务器,其他其节点同步hadoop01的时间
修改hadoop01 的/etc/ntp.conf文件,在内增加
server 127.0.0.1 #设置自己作为时间同步服务器 restrict 192.168.0.0
修改其他节点的/etc/ntp.conf文件,添加
server 192.168.127.128
sudo timedatectl set-ntp yes 所有节点启动时间同步 timedatectl #查看系统时间
5.
修改hbase配置文件
-
修改hbase-env.sh文件
```shell
#!/usr/bin/env bash
#
#/**
# * Licensed to the Apache Software Foundation (ASF) under one
# * or more contributor license agreements. See the NOTICE file
# * distributed with this work for additional information
# * regarding copyright ownership. The ASF licenses this file
# * to you under the Apache License, Version 2.0 (the
# * "License"); you may not use this file except in compliance
# * with the License. You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.8+ required.
# export JAVA_HOME=/usr/java/jdk1.8.0/
export JAVA_HOME=/usr/java/jdk1.8.0_192-amd64
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
export HBASE_CLASSPATH=/opt/hadoop-3.1.2/etc/hadoop
# The maximum amount of heap to use. Default is left to JVM default.
# export HBASE_HEAPSIZE=1G
# Uncomment below if you intend to use off heap cache. For example, to allocate 8G of
# offheap, set the value to "8G".
# export HBASE_OFFHEAPSIZE=1G
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://hbase.apache.org/book.html#performance
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC"
# Uncomment one of the below three options to enable java garbage collection logging for the server-side processes.
# This enables basic gc logging to the .out file.
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# Uncomment one of the below three options to enable java garbage collection logging for the client processes.
# This enables basic gc logging to the .out file.
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# See the package documentation for org.apache.hadoop.hbase.io.hfile for other configurations
# needed setting up off-heap block caching.
# Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
# NOTE: HBase provides an alternative JMX implementation to fix the random ports issue, please see JMX
# section in HBase Reference Guide for instructions.
# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
# export HBASE_REST_OPTS="$HBASE_REST_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10105"
# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
# Uncomment and adjust to keep all the Region Server pages mapped to be memory resident
#HBASE_REGIONSERVER_MLOCK=true
#HBASE_REGIONSERVER_UID="hbase"
# File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.
# export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters
# Extra ssh options. Empty by default.
# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"
# Where log files are stored. $HBASE_HOME/logs by default.
# export HBASE_LOG_DIR=${HBASE_HOME}/logs
# Enable remote JDWP debugging of major HBase processes. Meant for Core Developers
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073"
# export HBASE_REST_OPTS="$HBASE_REST_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8074"
# A string representing this instance of hbase. $USER by default.
# export HBASE_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HBASE_NICENESS=10
# The directory where pid files are stored. /tmp by default.
export HBASE_PID_DIR=/opt/hbase-2.2.1/pids
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HBASE_SLAVE_SLEEP=0.1
# Tell HBase whether it should manage it's own instance of ZooKeeper or not.
# 设置为fasle使用自己的zookeeper,设置为true使用hbase自身zk
# export HBASE_MANAGES_ZK=true
export HBASE_MANAGES_ZK=false
# The default log rolling policy is RFA, where the log file is rolled as per the size defined for the
# RFA appender. Please refer to the log4j.properties file to see more details on this appender.
# In case one needs to do log rolling on a date change, one should set the environment property
# HBASE_ROOT_LOGGER to "<DESIRED_LOG LEVEL>,DRFA".
# For example:
# HBASE_ROOT_LOGGER=INFO,DRFA
# The reason for changing default to RFA is to avoid the boundary case of filling out disk space as
# DRFA doesn't put any cap on the log size. Please refer to HBase-5655 for more context.
# Tell HBase whether it should include Hadoop's lib when start up,
# the default value is false,means that includes Hadoop's lib.
# export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP="true"
-
设置hbase-site.xml文件
<configuration>
<!--hbasemaster的主机和端口-->
<property>
<name>hbase.master</name>
<value>hadoop1:60000</value>
</property>
<!--<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
-->
<!--时间同步允许的时间差-->
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
<!--hbase共享目录,持久化hbase数据-->
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop01:9000/hbase</value>
</property>
<!--是否分布式运行,false即为单机-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!--zookeeper地址-->
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop01,hadoop02,hadoop03</value>
</property>
<!--zookeeper配置信息快照的位置-->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hbase/tmp/zookeeper</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>
-
设置regionservers 文件
#该文件是配置hbase salves节点
hadoop02
hadoop03
-
将hadoop中的两个配置文件 core-site.xml和hdfs-site.xml文件复制到hbase下的配置文件夹中
cp /opt/hadoop-3.1.2/etc/hadoop/core-site.xml /opt/hbase-2.2.1/conf
cp /opt/hadoop-3.1.2/etc/hadoop/core-site.xml /opt/hbase-2.2.1/conf
-
将hadoop01机器上的/opt/hbase-2.2.1/conf/* 分发到hadoop02、hadoop03节点上
scp /opt/hbase-2.2.1/conf/* hadoop02:/opt/hbase-2.2.1/conf/
scp /opt/hbase-2.2.1/conf/* hadoop03:/opt/hbase-2.2.1/conf/
-
启动与关闭hbase
start-hbase.sh
stop-hbase.sh
-
查看启动的hbase服务
# 在hadoop01节点上只有HMaster在启动,作为主节点,hadoop02和hadoop03作为slaves节点只有HRegionServer启动
jps
-
进入hbase的shell
hbase shell