HBase - hbase-数据迁移 - 《大数据相关知识》

hbase层
- copyTable 方式
  - 集群间进行数据拷贝
总结

Hbase层
hadoop层

hbase层

copyTable 方式

集群间进行数据拷贝

使用说明

Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
Options:
 rs.class     hbase.regionserver.class of the peer cluster
              specify if different from current cluster
 rs.impl      hbase.regionserver.impl of the peer cluster
 startrow     the start row
 stoprow      the stop row
 starttime    beginning of the time range (unixtime in millis)
              without endtime means from starttime to forever
 endtime      end of the time range.  Ignored if no starttime specified.
 versions     number of cell versions to copy
 new.name     new table's name
 peer.adr     Address of the peer cluster given in the format
              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     comma-separated list of families to copy
              To copy from cf1 to cf2, give sourceCfName:destCfName. 
              To keep the same name, just give "cfName"
 all.cells    also copy delete markers and deleted cells
Args:
 tablename    Name of the table to copy
Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable 
For performance consider the following general options:
-Dhbase.client.scanner.caching=100
-Dmapred.map.tasks.speculative.execution=false

方式1

create 'table_test',{NAME=>"i"}   #目的集群上先创建一个与原表结构相同的表
hbase 
org.apache.hadoop.hbase.mapreduce.CopyTable 
--peer.adr=zk-addr1,zk-addr2,zk-addr3:2181:/hbase 
table_test
例如：
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=192.168.1.253:2181:/hbase test1
（--peer.adr后面跟目标的zookeeper服务）

方式2:支持时间区间、row区间，改变表名称，改变列簇名称，指定是否copy删除数据等功能，例如：

hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--starttime=1265875194289  // 起始时间
--endtime=1265878794289    // 结束时间
--peer.adr= dstClusterZK:2181:/hbase  // zookeeper
--families=myOldCf:myNewCf,cf2,cf3  // 列族
TestTable  // 目标表
增加性能可以添加一下选项：
// 设置scanner的缓存大小，如果设置更大的缓存值，需要更大的内存空间
-Dhbase.client.scanner.caching=100  
// 为了预防数据写入两次，将该选项设置false
-Dmapreduce.map.speculative=false

通用的例子

 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
 --starttime=1265875194289
 --endtime=1265878794289
 --peer.adr=server1,server2,server3:2181:/hbase 
 --families=myOldCf:myNewCf,cf2,cf3
 TestTable

将hbase上某个数据表导出到本地

hbase org.apache.hadoop.hbase.mapreduce.Export emp file:///Users/a6/Applications/experiment_data/hbase_data/bak

性能优化方向

按照时间范围，进行数据拷贝

总结

DistCp: 文件层的数据同步，也是我们常用的
CopyTable: 这个涉及对原表数据Scan，然后直接Put到目标表，效率较低
Export/Import: 类似CopyTable, Scan出数据放到文件，再把文件传输到目标集群作Import
Snapshot: 比较常用 ， 应用灵活，采用快照技术，效率比较高
具体应用时，要结合自身表的特性，考虑数据规模、数据读写方式、实时数据&离线数据等方面，再选择使用哪种。