1 hbase简介

hbase是面向列的分布式数据库(DBMS)
可伸缩(存储在HDFS中,regionServer可以扩容)
高可用,可靠
高性能,避免并发
面向列,底层存储的数据KV/表结构是以列族为单元
稀疏性:每行的数据的属性可能不一致(半结构化数据)
数据多版本以列族为单位 cf1 VERSIONS=3
无模式 schema结构不明确

场景

表数据的行数几千万条
并发性
hbase不支持SQL ,不适合做报表
HashMap/根据key(mid)获取数据比较快快速获取数据的维度单一，例如：

“忠实粉丝”——-> user数据

今天的用户 date

20~30岁 —->

签数据/用户画像

依赖环境

zookeeper
HDFS底层存储
JDK
时间同步

HBASE架构

Hmaster
HRegionServer
region: 表的行范围数据
HDFS存储数据

安装部署

start-hbase.sh启动

http://linux01:16010

hbase-shell客户端

hbase shell
help
version whoami status
create put scan alter
list desc create_namespace

数据在HDFS上存储的位置信息

/hbase/data/default/table_name/region_name/列族s/hfile文件

hbase hfile -p -f hfile的路径

K: V:

2 SHELL-CMD

2.1 DDL

alter, alter_async, alter_status, clone_table_schema, create, describe,

disable: 禁用表

disable  'table_name'  -- 禁用表  禁用的表示不能对表的数据进行操作的
禁用的表示可以修改表结构的  
alter  'tb_a1' , 'delete'=>'cf3'

enable,启用禁用的表

enable  'tb_a1'

disable_all,

enable_all,

hbase(main):021:0> disable_all  't.*'
tb_a1                                                                                                                
tb_a2                                                                                                                
tb_a3                                                                                                                
tb_a4                                                                                                                
tb_aa                                                                                                                
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled

hbase(main):022:0> enable_all  't.*'
tb_a1                                                                                                                
tb_a2                                                                                                                
tb_a3                                                                                                                
tb_a4                                                                                                                
tb_aa                                                                                                                
Enable the above 5 tables (y/n)?
y
5 tables successfully enabled
Took 8.3687 seconds

drop #删除表

disable  'tb_a2'
drop 'tb_a2'

drop_all,

disable_all  
drop_all

exists, 查看表是否存在

hbase(main):036:0> exists  'tb_a1'
Table tb_a1 does exist                                                                                               
Took 0.0096 seconds                                                                                                  
=> true

get_table 类似于别名

t_user = get_table 'tb_ods_app_wx_user'

is_disabled, 是否是禁用状态

is_enabled, 是否是启用状态

hbase(main):014:0> is_disabled  'tb_a1'
true                                                                                          Took 0.0108 seconds     
hbase(main):016:0> is_enabled  'tb_a2'
true

list查看系统中所有的表

list 'doit24:.*'  -- 查看指定名称空间下的表

list_regions — 查看表的所有的region信息

shell命令、常用api、热点问题 - 图1

shell命令、常用api、热点问题 - 图2

表如何才能有多个region呢?

表的数据行数达到阈值, region会自动的拆分
手动的强制拆分 / 指定切割点

shell命令、常用api、热点问题 - 图3

split  'tb_name' , 'rk003'
list_regions  'tb_name'

locate_region 查看key所在的region的位置信息

locate_region  'tb_a1' , 'rk003'
HOST                                  REGION                                                                                                    
 linux03:16020                        {ENCODED => c1a08f08a64dd91da77c5bf97e4f6dd3, NAME => 'tb_a1,rk003,1625192839094.c1a08f08a64dd91da77c5bf97
                                      e4f6dd3.', STARTKEY => 'rk003', ENDKEY => ''}                                                             
1 row(s)
c1a08f08a64dd91da77c5bf97e4f6dd3 -- REGION文件夹

show_filters能实现 where 条件查询但是性能很低

scan 'tb_a1', {FILTER => "ValueFilter (=, 'binary:zss')"}  -- 获取值是zss的数据
scan 'tb_a1', {ROWPREFIXFILTER => 'rk00'}  rk00开头的行数据
DependentColumnFilter                                                                                                                           
KeyOnlyFilter                                                                                                                                   
ColumnCountGetFilter                                                                                                                            
SingleColumnValueFilter                                                                                                                         
PrefixFilter                                                                                                                                    
SingleColumnValueExcludeFilter                                                                                                                  
FirstKeyOnlyFilter                                                                                                                              
ColumnRangeFilter                                                                                                                               
ColumnValueFilter                                                                                                                               
TimestampsFilter                                                                                                                                
FamilyFilter                                                                                                                                    
QualifierFilter         属性                                                                                                                         
ColumnPrefixFilter                                                                                                                              
RowFilter     行键                                                                                                                                   
MultipleColumnPrefixFilter                                                                                                                      
InclusiveStopFilter                                                                                                                             
PageFilter                                                                                                                                      
ValueFilter    值

2.2 DML

append 追加在已有的单元格的后面添加子串

append  'tb_a1' , 'a0001' , 'cf1:name' , '_abc'
 a0001     column=cf1:name, timestamp=1625195924514, value=zss    前
 a0001     column=cf1:name, timestamp=1625195924514, value=zss_abc   后
 -- 如果单元格不存在  , 创建新的单元格  类似于put操作
 hbase(main):012:0> append  'tb_a1' , 'a0002' , 'cf1:name' , '_abc'
CURRENT VALUE = _abc
Took 0.0134 seconds

count, 统计表数据行数

hbase(main):014:0> count 'tb_a1'
6 row(s)
Took 0.0675 seconds                                                                                                                             
=> 6

delete 只能删除单元格

delete 'tb_a1' , 'a0001' , 'cf1:name'

deleteall 删除整行 ,或者一个单元格

delete 'tb_a1' , 'a0001'    删除行 
delete 'tb_a1' , 'a0001' , 'cf1:name'   删除单元格

get获取数据行为单位根据rowkey 开发中使用比较多

get 'tb_a1' , 'a0001'   一行 
get  'tb_a1' , 'a0001' ,'cf1:age'  一个单元格
hbase(main):044:0> get  'tb_a1' , 'a0001' ,'cf1:age' , 'cf2:job'  多个单元格
COLUMN                                CELL                                                                                                      
 cf1:age                              timestamp=1625198011401, value=23                         cf2:job                              timestamp=1625198030308, value=coder  
 get 'tb_a1' , 'rk001' , 'cf2:account'
 get 'tb_a1' , 'rk001' , {COLUMN=>'cf2:account' , VERSIONS=>2}

incr 增加一个自增的字段用来计数

incr 'tb_name' , 'rk' , 'cf:col' ,  20  初始值
incr 'tb_name' , 'rk' , 'cf:col' ,  2  累加2 
incr 'tb_name' , 'rk' , 'cf:col' ,  1  累加1

get_counter 获取incr字段的值

get_counter 'tb_a1' , 'rk006' , 'cf1:age'

get_splits 获取表的切割点

hbase(main):026:0> get_splits  'tb_a1'
Total number of splits = 2
rk003

put 单元格数据进行put操作
每次Put都会进行RPC[网络]请求 , 多次交互 ,效率低

缓存一批写一次 ,减少请求次数
HBASE 是一个分布式数据库 , 存储大量的数据 , 假如我们已经存在了大量的静态数据在HDFS中

           [文件]  --->MR ---->输出到hbase的表写以hfile的格式

scan 全表检索数据 /臃肿/数据量大一般不用

truncate ;删除数据 ,删除原来的切割点,region信息/ 仅仅保留表的基本结构 l列族

truncate   'tb_a1'

2 JAVA-API

2.1 添加依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.example</groupId>
    <artifactId>doit24_hbase</artifactId>
    <version>1.0-SNAPSHOT</version>
    <!--JDK的版本-->
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>
    <dependencies>
        <!--zookeeper-->
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>3.4.6</version>
        </dependency>
        <!--hadoop-->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.1</version>
        </dependency>
        <!-- HBASE依赖 -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.2.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.2.5</version>
        </dependency>
        <!-- 使用mr程序操作hbase 数据的导入 -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.2.5</version>
        </dependency>
        <!--json解析-->
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.5</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <!-- get all project dependencies -->
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <!-- MainClass in mainfest make a executable jar -->
                    <archive>
                        <manifest>
                            <!--<mainClass>util.Microseer</mainClass> -->
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <!-- bind to the packaging phase -->
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

2.2 入门示例1

package com._51doit.hbase.day02;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Table;
import java.io.IOException;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 使用java操作Hbase
 *     连接zookeeper,获取HBase信息
 *
 *  1 获取一个连接对象
 *  2 conn获取Table对象
 *  3 Table  DML有关
 *       get  scan  put  delete incr  append
 *  4 Admin  DDL
 *         namespace
 *         tools
 */
public class Hbase01_Demo {
    public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        // 设置zookeeper的地址
        conf.set("hbase.zookeeper.quorum","linux01:2181,linux02:2181,linux03:2181");
        // 获取hbase的连接对象
        Connection conn = ConnectionFactory.createConnection(conf);
        // 表名
        TableName tableName = TableName.valueOf("tb_a1");
        // 表对象
        Table table = conn.getTable(tableName);
        Admin admin = conn.getAdmin();
        // 释放连接
        conn.close();
    }
}

创建获取连接HBASE的工具类

package doit.day02;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import java.io.IOException;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 */
public class HbaseUtils {
    /**
     * 获取hbase的连接对象
     * @return
     * @throws Exception
     */
    public  static Connection getConnection() throws Exception {
        Configuration conf = HBaseConfiguration.create();
        // 设置zookeeper的地址
        conf.set("hbase.zookeeper.quorum","linux01:2181,linux02:2181,linux03:2181");
        // 获取hbase的连接对象
        Connection conn = ConnectionFactory.createConnection(conf);
        return  conn ;
    }
}

2.3 获取所有的名称空间/所有的表

package com._51doit.hbase.day02;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.NamespaceDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 1 list 命令
 *    获取系统中所有的表名
 *  2 获取所有的namespace
 */
public class Hbase02_Demo2 {
    public static void main(String[] args) throws  Exception{
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum","linux01:2181,linux02:2181,linux03:2181");
        Connection conn = ConnectionFactory.createConnection(conf);
        Admin admin = conn.getAdmin();
        // 获取所有的名称空间
        NamespaceDescriptor[] namespaceDescriptors = admin.listNamespaceDescriptors();
        for (NamespaceDescriptor namespaceDescriptor : namespaceDescriptors) {
            String name = namespaceDescriptor.getName();
            System.out.println(name);
        }
        // admin  获取所有的表
        TableName[] tableNames = admin.listTableNames();
        // 遍历
        for (TableName tableName : tableNames) {
            // 获取表名  字节数组
            byte[] name = tableName.getName();
            // 封装String打印结果
          //  System.out.println(new String(name));
        }
        conn.close();
    }
}

2.4 建表(一个列族)

package doit.day02;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 建表:表名:至少一个列族
 *  create tb_name , cf
 *  建表 createTable(tableDescriptor);
 *  参数 表的描述器
 *    构建器  -->  构建
 *    1 表的描述器构建器
 *      ---添加列族描述器
 *    2 build 表的描述器
 *    1 列族描述器构建器
 *    2 build 列族描述器
 */
public class Hbase03_CreateTable01 {
    public static void main(String[] args) throws Exception {
        Connection conn = HbaseUtils.getConnection();
        Admin admin = conn.getAdmin();
        /**
         * 表的描述器构建器
         *   描述器构建器 构建表的描述器
         * 列族描述器构建器
         *   构建列族描述器
         * TableDescriptor
         */
        TableDescriptorBuilder tableDescriptorBuilder = TableDescriptorBuilder.newBuilder(TableName.valueOf("tb_user"));
         // 列族描述器构建器
        ColumnFamilyDescriptorBuilder columnFamilyDescriptorBuilder = ColumnFamilyDescriptorBuilder.newBuilder("cf".getBytes());
        ColumnFamilyDescriptor familyDescriptor = columnFamilyDescriptorBuilder.build();
        // 将列族描述器 添加到表中
        tableDescriptorBuilder.setColumnFamily(familyDescriptor) ;
        // 构建表描述器
        TableDescriptor tableDescriptor = tableDescriptorBuilder.build();
        // 建表
        admin.createTable(tableDescriptor);
        conn.close();
    }
}

2.5 建表（多列族）

package doit.day02;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import java.util.ArrayList;
import java.util.List;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 创建多列族的表
 */
public class Hbase04_CreateTable02 {
    public static void main(String[] args) throws  Exception {
        Connection conn = HbaseUtils.getConnection();
        Admin admin = conn.getAdmin();
        // 表的描述器构建器
        TableDescriptorBuilder tbbuild = TableDescriptorBuilder.newBuilder(TableName.valueOf("tb_teacher"));
        // 列族的描述器构建器  cf1
        ColumnFamilyDescriptorBuilder cf1Build = ColumnFamilyDescriptorBuilder.newBuilder("cf1".getBytes());
        // 构建表述器
        ColumnFamilyDescriptor cf1 = cf1Build.build();
        // 列族的描述器构建器  cf2 
        ColumnFamilyDescriptorBuilder cf2Build = ColumnFamilyDescriptorBuilder.newBuilder("cf2".getBytes());
        // 构建表述器
        ColumnFamilyDescriptor cf2 = cf2Build.build();
        // 列族的描述器存入集合
        List<ColumnFamilyDescriptor> cfs = new ArrayList<>() ;
        cfs.add(cf1);
        cfs.add(cf2);
        tbbuild.setColumnFamilies(cfs);
        //构建表的描述器
        TableDescriptor tableDescriptor = tbbuild.build();
        admin.createTable(tableDescriptor);
        admin.close();
        conn.close();
    }
}

2.6 建表（多列族，并且设置列族属性）

package doit.day02;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import java.util.ArrayList;
import java.util.List;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 创建多列族的表，设置他们的属性
 */
public class Hbase05_CreateTable03 {
    public static void main(String[] args) throws  Exception {
        Connection conn = HbaseUtils.getConnection();
        Admin admin = conn.getAdmin();
        // 表的描述器构建器
        TableDescriptorBuilder tbbuild = TableDescriptorBuilder.newBuilder(TableName.valueOf("tb_teacher2"));
        // 列族的描述器构建器
        ColumnFamilyDescriptorBuilder cf1Build = ColumnFamilyDescriptorBuilder.newBuilder("cf1".getBytes());
        /**
         * 设置当前列族的参数
         */
        cf1Build.setMaxVersions(3) ;
        // 构建表述器
        ColumnFamilyDescriptor cf1 = cf1Build.build();
        // 列族的描述器构建器
        ColumnFamilyDescriptorBuilder cf2Build = ColumnFamilyDescriptorBuilder.newBuilder("cf2".getBytes());
        /**
         * 设置当前列族的参数
         * 设置列族数据的过期时间 单位 s 妙
         */
        cf2Build.setTimeToLive(240) ;
           // 构建表述器
        ColumnFamilyDescriptor cf2 = cf2Build.build();
        // 设置多个列族
        List<ColumnFamilyDescriptor> cfs = new ArrayList<>() ;
        cfs.add(cf1);
        cfs.add(cf2);
        tbbuild.setColumnFamilies(cfs);
        //构建描述器
        TableDescriptor tableDescriptor = tbbuild.build();
        admin.createTable(tableDescriptor);
        admin.close();
        conn.close();
    }
}

2.7 创建预分region表

package doit.day02;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import java.util.ArrayList;
import java.util.List;
/**
 * Author:   Hang.Z
 * Date:     21/07/02
 * Description:
 * 创建预分region表
 */
public class Hbase06_CreatePreRegionTable04 {
    public static void main(String[] args) throws  Exception {
        Connection conn = HbaseUtils.getConnection();
        Admin admin = conn.getAdmin();
        // 表的描述器构建器
        TableDescriptorBuilder tbbuild = TableDescriptorBuilder.newBuilder(TableName.valueOf("tb_pre_region2"));
        // 列族的描述器构建器
        ColumnFamilyDescriptorBuilder cf1Build = ColumnFamilyDescriptorBuilder.newBuilder("cf1".getBytes());
        /**
         * 设置当前列族的参数
         */
        cf1Build.setMaxVersions(3) ;
        // 构建表述器
        ColumnFamilyDescriptor cf1 = cf1Build.build();
        // 列族的描述器构建器
        ColumnFamilyDescriptorBuilder cf2Build = ColumnFamilyDescriptorBuilder.newBuilder("cf2".getBytes());
        /**
         * 设置当前列族的参数
         * 设置列族数据的过期时间 单位 s 妙
         */
        cf2Build.setTimeToLive(240) ;
           // 构建表述器
        ColumnFamilyDescriptor cf2 = cf2Build.build();
        // 设置多个列族
        List<ColumnFamilyDescriptor> cfs = new ArrayList<>() ;
        cfs.add(cf1);
        cfs.add(cf2);
        tbbuild.setColumnFamilies(cfs);
        //构建描述器
        TableDescriptor tableDescriptor = tbbuild.build();
        /**
         * 参数一
         * 参数二 , 分割点  在hbase中所有数据都是字节数组
         *  a.getBytes()
         *   [a.getBytes(),o.getBytes()]
         *   a0001
         *   z001
         */
        byte[][] keys = new byte[][]{"f".getBytes(),"o".getBytes(),"x".getBytes()} ;
        admin.createTable(tableDescriptor , keys);
        admin.close();
        conn.close();
    }
}

3 热点问题

问题描述，HBASE在数据量很小的时候，默认只会为数据分配一个region，随着数据量的增多，越来越多的数据存储在同一台机器，而不是分布在多台机器上，此时查询速度就会变慢，插入数据也会变慢。这就是热点问题。

shell命令、常用api、热点问题 - 图4

解决方案：

在创建表的时候，指定切割点，手动将表分为多个region，将数据根据切割点合理的分配在不同的节点上。

shell命令、常用api、热点问题 - 图5

shell端

create  'tb_pre_region' , 'cf' , SPLITS=>['f','o','x']

Java端

byte[][] keys = new byte[][]{"f".getBytes(),"o".getBytes(),"x".getBytes()} admin.createTable(tableDescriptor , keys);