HBase

  1. HBASE是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统

数据模型

  1. Name Space(命名空间)
    命名空间,相当于关系型数据库中database,每个命名空间下有很多表,HBase自带了两个命名空间分别是default个HBase,HBase中存放了Hbase内置的表,而default是用户默认使用的命名空间。

  2. Region(表)
    类似于关系型数据库中表,HBase在定义表时只需要声明列簇,不需要声明具体的列

  3. Row(行)
    HBase中每行数据都由一个RowKey和多个(Column)列组成,数据是按照RowKey的字典序排序存储的,而查询时只能根据RowKey进行检索

  4. Column(列)
    HBase中每个列都是由列簇(Column Family)和列限定符(Column Qualifier)进行限定,
    例:info:name,info:age;建表时,只需要指定列簇。

  5. Time Stamp(时间戳)
    用来标识数据的不同版本

  6. Cell(单元)
    由{rowkey,column Family:column Qualifier,TimeStamp} 唯一确定的单元,cell中的数据没有类型,是以字节码进行存储

HBase系统架构

  1. Hbase的存储机制
    Hbase 是一个面向列的数据库,在表中按行进行排序,表模式定义只能列簇,键值对形式。一个表有多个列簇和一个列簇有多个列;

    • 表是行的集合
    • 行是列簇的集合
    • 列簇是列的集合
    • 列是键值对的集合
  2. Hbase系统架构体系图

HBase - 图1

HBase是一个分布式存储系统,有HMaster和HRegionServer;

  • Client:使用HBase RPC 机制与HMaster和HRegionServer进行通行;Client与HMaster进行管理类操作,与HRegionServer进行数据读写类操作;

  • HMaster:有多个节点的Hbase Master,根据zookeeper的Master Election的机制保证总有一个柱节点在运行;
    HMaster主要负责Table和Region的管理:

    1. 管理用户对表的增删改查操作(改是put操作,新增一条数据)
    2. 管理HRegionServer的负载均衡,调整Region分布
    3. Region Split后,负责新Region的分布
    4. 在HRegionServer停机后,负责将失效HRegionServer上Region迁移
  • zookeeper: zookeeper集群存储-ROOT-表地址、HMaster地址;HRegionServer把自己以Ephedral方式注册到Zookeeper中,HMaster随时感知各个HRegionServer的健康状况

  • HRegionServer: HBase中最核心的模块,主要负责响应用户I/O请求,向HDFS文件系统中读写HBase - 图2
    通过上图可以了解到,HRegionServer管理很多HRegion对象;

    • client访问hbase上的数据并不需要master参与,master仅仅维护table和region的元数据信息

    • 每个HRegion对应Table中的一个Region,HRegion由多个HStore组成;

    • 一个HRegion(表)有多少个列族就有多少个Store。一个HRegionServer会有多个HRegion和一个HLog

  • HRegion:

HBase常用命令

  1. 进入 hbase shell
  2. 退出 exit
  3. 查看hbase状态 status
  4. 创建表 create ‘表明’,’列簇名1’,’列簇名2’,’列簇名N’;
  5. 查看所有表 list
  6. 描述表 describe ‘表名’
  7. 判断表是否存在 exists ‘表名’
  8. 判断是否禁用启用表 is_enabled ‘表名’ is_disabled ‘表名’
  9. 添加记录 put ‘表名’,’rowkey’,’列簇:列’,’值’
  10. 查看记录rowkey 下所有数据 get ‘表名’,’rowkey’
  11. 查看所有记录 scan ‘表名’
  12. 查看表中的记录总数 count ‘表名’
  13. 获取某个列簇(获取某个列族的某个列) get ‘表名’,’rowkey’,’列簇:列’
  14. 删除记录 delete ‘表名’,’行名’,’列簇:列’
  15. 删除整行 deleteall ‘表名’,’行名’,’列簇:列’
  16. 删除一张表 首先屏蔽该表,然后删除该表 disable ‘表名’ drop ‘表名’
  17. 清空表 truncate ‘表名’
  18. 查看某个表某个列中所有数据 scan ‘表名’,{COLUMNS=>’列族名:列名’}

Hbase集群安装

  1. 安装前准备

    • zookeeper-3.4.14.tar.gz 安装包
    • hbase-2.2.1-bin.tar.gz安装包
    • Hadoop-3.1.2.tar.gz 安装包
    • 3台虚拟机
  2. 安装Hadoop
    请参考Hadoop.md文档,hadoop分布式集群部署安装步骤

  3. 安装zookeeper
    请参考zookeeper.md文档,其中提供zookeeper分布式集群部署安装步骤

  4. 安装hbase

    1. 把hbase-2.2.1-bin.tar.gz安装包分别上传到虚拟机hadoop01、hadoop02、hadoop03上

    2. 解压hbase安装包

      1. tar -zxvf hbase-2.2.1-bin.tar.gz
  1. 配置环境变量(所有节点上的环境变量)
    1. HBASE_HOME=/opt/hbase-2.2.1
    2. PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin
    3. export PATH CLASSPATH JAVA_HOME HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME
  1. 设置集群时间同步 ```shell yum -y install ntp ntpdate #安装ntpdate时间同步工具 sudo systemctl start ntpd #启动时间同步程序 sudo systemctl enable ntpd #允许时间同步程序开机启动

以hadoop01作为时间同步服务器,其他其节点同步hadoop01的时间

修改hadoop01 的/etc/ntp.conf文件,在内增加

server 127.0.0.1 #设置自己作为时间同步服务器 restrict 192.168.0.0

修改其他节点的/etc/ntp.conf文件,添加

server 192.168.127.128

sudo timedatectl set-ntp yes 所有节点启动时间同步 timedatectl #查看系统时间

  1. 5.
  2. 修改hbase配置文件
  3. -
  4. 修改hbase-env.sh文件
  5. ```shell
  6. #!/usr/bin/env bash
  7. #
  8. #/**
  9. # * Licensed to the Apache Software Foundation (ASF) under one
  10. # * or more contributor license agreements. See the NOTICE file
  11. # * distributed with this work for additional information
  12. # * regarding copyright ownership. The ASF licenses this file
  13. # * to you under the Apache License, Version 2.0 (the
  14. # * "License"); you may not use this file except in compliance
  15. # * with the License. You may obtain a copy of the License at
  16. # *
  17. # * http://www.apache.org/licenses/LICENSE-2.0
  18. # *
  19. # * Unless required by applicable law or agreed to in writing, software
  20. # * distributed under the License is distributed on an "AS IS" BASIS,
  21. # * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  22. # * See the License for the specific language governing permissions and
  23. # * limitations under the License.
  24. # */
  25. # Set environment variables here.
  26. # This script sets variables multiple times over the course of starting an hbase process,
  27. # so try to keep things idempotent unless you want to take an even deeper look
  28. # into the startup scripts (bin/hbase, etc.)
  29. # The java implementation to use. Java 1.8+ required.
  30. # export JAVA_HOME=/usr/java/jdk1.8.0/
  31. export JAVA_HOME=/usr/java/jdk1.8.0_192-amd64
  32. # Extra Java CLASSPATH elements. Optional.
  33. # export HBASE_CLASSPATH=
  34. export HBASE_CLASSPATH=/opt/hadoop-3.1.2/etc/hadoop
  35. # The maximum amount of heap to use. Default is left to JVM default.
  36. # export HBASE_HEAPSIZE=1G
  37. # Uncomment below if you intend to use off heap cache. For example, to allocate 8G of
  38. # offheap, set the value to "8G".
  39. # export HBASE_OFFHEAPSIZE=1G
  40. # Extra Java runtime options.
  41. # Below are what we set by default. May only work with SUN JVM.
  42. # For more on why as well as other possible settings,
  43. # see http://hbase.apache.org/book.html#performance
  44. export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC"
  45. # Uncomment one of the below three options to enable java garbage collection logging for the server-side processes.
  46. # This enables basic gc logging to the .out file.
  47. # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
  48. # This enables basic gc logging to its own file.
  49. # If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
  50. # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
  51. # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
  52. # If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
  53. # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
  54. # Uncomment one of the below three options to enable java garbage collection logging for the client processes.
  55. # This enables basic gc logging to the .out file.
  56. # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
  57. # This enables basic gc logging to its own file.
  58. # If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
  59. # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
  60. # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
  61. # If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
  62. # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
  63. # See the package documentation for org.apache.hadoop.hbase.io.hfile for other configurations
  64. # needed setting up off-heap block caching.
  65. # Uncomment and adjust to enable JMX exporting
  66. # See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
  67. # More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
  68. # NOTE: HBase provides an alternative JMX implementation to fix the random ports issue, please see JMX
  69. # section in HBase Reference Guide for instructions.
  70. # export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
  71. # export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
  72. # export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
  73. # export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
  74. # export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
  75. # export HBASE_REST_OPTS="$HBASE_REST_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10105"
  76. # File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
  77. # export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
  78. # Uncomment and adjust to keep all the Region Server pages mapped to be memory resident
  79. #HBASE_REGIONSERVER_MLOCK=true
  80. #HBASE_REGIONSERVER_UID="hbase"
  81. # File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.
  82. # export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters
  83. # Extra ssh options. Empty by default.
  84. # export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"
  85. # Where log files are stored. $HBASE_HOME/logs by default.
  86. # export HBASE_LOG_DIR=${HBASE_HOME}/logs
  87. # Enable remote JDWP debugging of major HBase processes. Meant for Core Developers
  88. # export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"
  89. # export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"
  90. # export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"
  91. # export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073"
  92. # export HBASE_REST_OPTS="$HBASE_REST_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8074"
  93. # A string representing this instance of hbase. $USER by default.
  94. # export HBASE_IDENT_STRING=$USER
  95. # The scheduling priority for daemon processes. See 'man nice'.
  96. # export HBASE_NICENESS=10
  97. # The directory where pid files are stored. /tmp by default.
  98. export HBASE_PID_DIR=/opt/hbase-2.2.1/pids
  99. # Seconds to sleep between slave commands. Unset by default. This
  100. # can be useful in large clusters, where, e.g., slave rsyncs can
  101. # otherwise arrive faster than the master can service them.
  102. # export HBASE_SLAVE_SLEEP=0.1
  103. # Tell HBase whether it should manage it's own instance of ZooKeeper or not.
  104. # 设置为fasle使用自己的zookeeper,设置为true使用hbase自身zk
  105. # export HBASE_MANAGES_ZK=true
  106. export HBASE_MANAGES_ZK=false
  107. # The default log rolling policy is RFA, where the log file is rolled as per the size defined for the
  108. # RFA appender. Please refer to the log4j.properties file to see more details on this appender.
  109. # In case one needs to do log rolling on a date change, one should set the environment property
  110. # HBASE_ROOT_LOGGER to "<DESIRED_LOG LEVEL>,DRFA".
  111. # For example:
  112. # HBASE_ROOT_LOGGER=INFO,DRFA
  113. # The reason for changing default to RFA is to avoid the boundary case of filling out disk space as
  114. # DRFA doesn't put any cap on the log size. Please refer to HBase-5655 for more context.
  115. # Tell HBase whether it should include Hadoop's lib when start up,
  116. # the default value is false,means that includes Hadoop's lib.
  117. # export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP="true"
  1. -

设置hbase-site.xml文件

  1. <configuration>
  2. <!--hbasemaster的主机和端口-->
  3. <property>
  4. <name>hbase.master</name>
  5. <value>hadoop1:60000</value>
  6. </property>
  7. <!--<property>
  8. <name>hbase.master.info.port</name>
  9. <value>60010</value>
  10. </property>
  11. -->
  12. <!--时间同步允许的时间差-->
  13. <property>
  14. <name>hbase.master.maxclockskew</name>
  15. <value>180000</value>
  16. </property>
  17. <!--hbase共享目录,持久化hbase数据-->
  18. <property>
  19. <name>hbase.rootdir</name>
  20. <value>hdfs://hadoop01:9000/hbase</value>
  21. </property>
  22. <!--是否分布式运行,false即为单机-->
  23. <property>
  24. <name>hbase.cluster.distributed</name>
  25. <value>true</value>
  26. </property>
  27. <!--zookeeper地址-->
  28. <property>
  29. <name>hbase.zookeeper.quorum</name>
  30. <value>hadoop01,hadoop02,hadoop03</value>
  31. </property>
  32. <!--zookeeper配置信息快照的位置-->
  33. <property>
  34. <name>hbase.zookeeper.property.dataDir</name>
  35. <value>/home/hbase/tmp/zookeeper</value>
  36. </property>
  37. <property>
  38. <name>hbase.unsafe.stream.capability.enforce</name>
  39. <value>false</value>
  40. </property>
  41. </configuration>
  1. -

设置regionservers 文件

  1. #该文件是配置hbase salves节点
  2. hadoop02
  3. hadoop03
  1. -

将hadoop中的两个配置文件 core-site.xml和hdfs-site.xml文件复制到hbase下的配置文件夹中

  1. cp /opt/hadoop-3.1.2/etc/hadoop/core-site.xml /opt/hbase-2.2.1/conf
  2. cp /opt/hadoop-3.1.2/etc/hadoop/core-site.xml /opt/hbase-2.2.1/conf
  1. -

将hadoop01机器上的/opt/hbase-2.2.1/conf/* 分发到hadoop02、hadoop03节点上

  1. scp /opt/hbase-2.2.1/conf/* hadoop02:/opt/hbase-2.2.1/conf/
  2. scp /opt/hbase-2.2.1/conf/* hadoop03:/opt/hbase-2.2.1/conf/
  1. -

启动与关闭hbase

  1. start-hbase.sh
  2. stop-hbase.sh
  1. -

查看启动的hbase服务

  1. # 在hadoop01节点上只有HMaster在启动,作为主节点,hadoop02和hadoop03作为slaves节点只有HRegionServer启动
  2. jps
  1. -

进入hbase的shell

  1. hbase shell