1、HBase读写数据出现 is not online on cdh7,16020,1642669836559
image.png
image.png
日志发现rowkey为NX_GD_NSSF_FJ_P3_L9_089_AI0013.1626750860235的记录出了问题,
2、检查HDFS块文件是否损坏

  1. hadoop fsck /hbase/data

image.png
并没有少块
3、检查报错的hbase表是否存在一致性问题

  1. sudo -u hbase hbase hbck -details NSSFJ_BAK

image.png
5057 inconsistencies detected

3、尝试进行修复

  1. sudo -u hbase hbase hbck -fixHdfsOverlaps <表名>
  2. sudo -u hbase hbase hbck -details <表名> | grep ERROR | grep Region | wc -l
  3. Sudo -u hbase hbase hbck -fixHdfsOverlaps -maxMerge 6 <表名>
  4. Sudo -u hbase hbase hbck -repairHoles <表名>
  5. sudo -u hbase hbase hbck -details <表名> | grep ERROR | grep Region | wc -l

参考:https://blog.51cto.com/u_12902538/3727656
image.png

修复完毕再次执行details,查询当前表没有
4、查看服务器上hbase-hbase-regionserver-rzx4.log日志发现是数据出错误了,出现了重复的数据,于是想着手动删除重复的数据试试
image.png
5、hdfs dfs -ls /hbase/WALs
image.png

hdfs dfs -rm -f -R /apps/hbase/data/WALs/rzx4,16020,1567520462849-splitting
网上其他方法说删除后需要重启hbase,我没有重启直接查看hbase表数据,发现竟然查出来了,再次执行application任务也显示正常了

二、拓展

1、HBaseFsck(hbck)

是一种命令行工具,可检查region一致性和表完整性问题并修复损坏。
2、它以两种基本模式工作:只读不一致标识模式、多阶段读写修复模式

  1. 只读不一致标识:再次模式下(默认),将生成报告,单不尝试维修
  2. 读写修复模式:在这种模式下如果发现错误,则hbck尝试修复它们。

3、hbck始终以HBase用户身份运行HBase管理命令
hbase-1.x附带hbck工具已在hbase-2.x设置为只读,由于hbase内部已更改,因此无法修复hbase-2.x集群。由于不了解hbase-2.x操作,因此也不应该信任其在制度模式下的评估。
CDH5.16.2中hbase版本为hbase1.2.0,才CDH6.x版本中不能使用hbase-1.x现在的命令
4、手动执行hbck

  1. hbck命令位于HBase安装目录下的bin目录中。
  2. 如果不带任何参数,则hbck将检查HBase的不一致性,如果未发现不一致性,则打印OK,否则将显示不一致性的数量。
  3. 使用-details参数,hbck会检查HBase的不一致情况并打印详细的报告。
  4. 要将hbck限制为仅检查特定的表,请将它们提供为以空格分隔的列表:hbck
  • 以下hbck选项会修改HBase元数据,这很危险。它们不受HMaster协调,并且可能与HMaster当前正在执行或协调的命令冲突,从而导致进一步损坏。即使HMaster关闭,它也可能会在重新启动时尝试恢复最新操作。这些选项只能用作不得已的方法。hbck命令只能修复实际的HBase元数据损坏,而不是通用维护工具。此外,运行任何这些命令都需要重新启动HMaster。
  1. 如果发现region级别的不一致,请使用-fix参数指示hbck尝试修复它们。遵循以下步骤顺序
    • 运行不一致的标准检查
    • 如果需要,可以修理表
    • 如果需要,可以对region进行修理。region在修理期间将关闭
  2. 您也可以分别修复各个region级别的不一致,而不是使用-fix参数自动修复它们。
    • -fixAssignments:修复未分配,分配错误或乘以分配的区域。
    • -fixMeta:当HDFS中不存在其对应region时,-fixMeta将从hbase:meta中删除行,如果HDFS中存在region但hbase:meta中不存在区域,则添加新的元数据行。
    • -repairHoles:为文件系统(HDFS等)上的新空白region创建HFiles并确保新的region一致
    • -fixHdfsOrphans:修复缺少region元数据文件(.regioninfo文件)的region目录
    • -fixHdfsOverlaps:修复重叠的region,您可以使用以下选项进一步调整此参数( The same start key. 超过5个时,一定要加-maxMerge 不然没用)
    • -maxMerge :控制要合并的最大region数
    • -sidelineBigOverlaps:当修复region重叠时,允许将较大的重叠放在一起
    • -maxOverlapsToSideline :当修复区域重叠时,每组最多允许个区域边线。(默认为n=2)
  3. 要尝试一次修复所有不一致和损坏,请使用-repair选项,该选项包括所有区域和表一致性选项。

hbck命令

  1. 要运行hbck,请使用hbase hbck命令。使用-h选项运行它以获取更多使用信息。
  2. 一次修复步骤

    1. sudo -u hbase hbase hbck -fixHdfsOverlaps <表名>
    2. sudo -u hbase hbase hbck -details <表名> | grep ERROR | grep Region | wc -l
    3. sudo -u hbase hbase hbck -fixHdfsOverlaps -maxMerge 6 <表名>
    4. sudo -u hbase hbase hbck -repairHoles <表名>
    5. sudo -u hbase hbase hbck -details <表名> | grep ERROR | grep Region | wc -l

    2、hbck2

    2.1 简介

    HBCK2 目前是一个简单的工具,一次只做一件事。
    在 hbase-2.x 中,Master 是所有状态的最终仲裁者,因此大多数 HBCK2 命令的一般原则是它要求 Master 进行所有修复。 这意味着在您可以运行 HBCK2 命令之前,必须先启动 Master。
    HBCK2 实现方法是利用托管在 Master 上的 HbckService。 该服务发布了一些方法供 HBCK2 工具使用。 因此,对于依赖于 Master 的 HbckService 门面的 HBCK2 命令,HBCK2 做的第一件事就是对集群进行 poke 以确保服务可用。 如果远程服务器没有发布服务或者 HbckService 缺少请求的方法,这将失败。 对于后一种情况,如果可以,请更新您的集群以获得更多修复工具。
    HBCK是HBase1.x中的命令,到了HBase2.x中,HBCK命令不适用,且它的写功能(-fix)已删除,它虽然还可以报告HBase2.x集群的状态,但是由于它不了解HBase2.x集群内部的工作原理,因此其评估将不准确。因此,如果你正在使用HBase2.x,那么对HBCK2应该需要了解一些,即使你不经常用到。

    2.2 获取|使用

    BCK2已经被剥离出HBase成为了一个单独的项目,如果你想要使用这个工具,需要根据自己HBase的版本,编译源码。
    GitHub地址

  3. 在pom中将hbase版本换成自己实际的hbase2.x版本,项目根目录下运行打包命令:

    1. mvn clean install -DskipTests
  4. 打包完成后,是有多个jar包的,将自己需要的hbck2取出来hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar。

  5. 使用Hback2

    1. HBCK2其依赖项的最简单方法是通过脚本启动$HBASE_HOME/bin/hbase。该bin/hbase脚本本身就提到了hbck-hbck帮助输出中列出了一个选项。默认情况下, running将运行bin/hbase hbck内置的hbck1工具。要运行HBCK2,您需要使用以下选项指向构建的HBCK2 jar -j

      1. ${HBASE_HOME}/bin/hbase --config /etc/hbase-conf hbck -j ~/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
    2. 上面/etc/hbase-conf的位置是部署的配置所在的位置,上面没有传递选项或参数的命令将转储出HBCK2帮助:

      1. 用法:HBCK2 [OPTIONS] COMMAND <ARGS>
      2. 选项:
      3. -d,--debug debug模式输出日志
      4. -h,--help 输出此帮助消息
      5. -p,--hbase.zookeeper.property.clientPort zookeeper端口
      6. -q,--hbase.zookeeper.quorum hbase集成
      7. -s,--skip 跳过hbase版本检查
      8. -v,--version hbck2版本
      9. -z,--zookeeper.znode.parent hbasezkNode中的父路径
      10. 命令:
      11. addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>
      12. 用法:
      13. -d,--force_disable

      ```shell usage: HBCK2 [OPTIONS] COMMAND Options: -d,—debug run with debug output -h,—help output this help message -p,—hbase.zookeeper.property.clientPort port of hbase ensemble -q,—hbase.zookeeper.quorum hbase ensemble -s,—skip skip hbase version check

      1. (PleaseHoldException)

      -v,—version this hbck2 version -z,—zookeeper.znode.parent parent znode of hbase

      1. ensemble

      Command: addFsRegionsMissingInMeta … Options: -d,—forcedisable aborts fix for table if disable fails. To be used when regions missing from hbase:meta but directories are present still in HDFS. Can happen if user has run _hbck1 ‘OfflineMetaRepair’ against an hbase-2.x cluster. Needs hbase:meta to be online. For each table name passed as parameter, performs diff between regions available in hbase:meta and region dirs on HDFS. Then for dirs with no hbase:meta matches, it reads the ‘regioninfo’ metadata file and re-creates given region in hbase:meta. Regions are re-created in ‘CLOSED’ state in the hbase:meta table, but not in the Masters’ cache, and they are not assigned either. To get these regions online, run the HBCK2 ‘assigns’command printed when this command-run completes. NOTE: If using hbase releases older than 2.3.0, a rolling restart of HMasters is needed prior to executing the set of ‘assigns’ output. An example adding missing regions for tables ‘tbl_1’ in the default namespace, ‘tbl_2’ in namespace ‘n1’ and for all tables from namespace ‘n2’: $ HBCK2 addFsRegionsMissingInMeta default:tbl_1 n1:tbl_2 n2 Returns HBCK2 an ‘assigns’ command with all re-inserted regions. SEE ALSO: reportMissingRegionsInMeta SEE ALSO: fixMeta

    assigns [OPTIONS] … Options: -o,—override override ownership by another procedure -i,—inputFiles take one or more files of encoded region names A ‘raw’ assign that can be used even during Master initialization (if the -skip flag is specified). Skirts Coprocessors. Pass one or more encoded region names. 1588230740 is the hard-coded name for the hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example of what a user-space encoded region name looks like. For example: $ HBCK2 assigns 1588230740 de00010733901a05f5a2a3a382e27dd4 Returns the pid(s) of the created AssignProcedure(s) or -1 if none. If -i or —inputFiles is specified, pass one or more input file names. Each file contains encoded region names, one per line. For example: $ HBCK2 assigns -i fileName1 fileName2

    bypass [OPTIONS] … Options: -o,—override override if procedure is running/stuck -r,—recursive bypass parent and its children. SLOW! EXPENSIVE! -w,—lockWait milliseconds to wait before giving up; default=1 Pass one (or more) procedure ‘pid’s to skip to procedure finish. Parent of bypassed procedure will also be skipped to the finish. Entities will be left in an inconsistent state and will require manual fixup. May need Master restart to clear locks still held. Bypass fails if procedure has children. Add ‘recursive’ if all you have is a parent pid to finish parent and children. This is SLOW, and dangerous so use selectively. Does not always work.

    extraRegionsInMeta … Options: -f, —fix fix meta by removing all extra regions found. Reports regions present on hbase:meta, but with no related directories on the file system. Needs hbase:meta to be online. For each table name passed as parameter, performs diff between regions available in hbase:meta and region dirs on the given file system. Extra regions would get deleted from Meta if passed the —fix option. NOTE: Before deciding on use the “—fix” option, it’s worth check if reported extra regions are overlapping with existing valid regions. If so, then “extraRegionsInMeta —fix” is indeed the optimal solution. Otherwise, “assigns” command is the simpler solution, as it recreates regions dirs in the filesystem, if not existing. An example triggering extra regions report for tables ‘table_1’ and ‘table_2’, under default namespace: $ HBCK2 extraRegionsInMeta default:table_1 default:table_2 An example triggering missing regions report for table ‘table_1’ under default namespace, and for all tables from namespace ‘ns1’: $ HBCK2 extraRegionsInMeta default:table_1 ns1 Returns list of extra regions for each table passed as parameter, or for each table on namespaces specified as parameter.

    filesystem [OPTIONS] […] Options: -f, —fix sideline corrupt hfiles, bad links, and references. Report on corrupt hfiles, references, broken links, and integrity. Pass ‘—fix’ to sideline corrupt files and links. ‘—fix’ does NOT fix integrity issues; i.e. ‘holes’ or ‘orphan’ regions. Pass one or more tablenames to narrow checkup. Default checks all tables and restores ‘hbase.version’ if missing. Interacts with the filesystem only! Modified regions need to be reopened to pick-up changes.

    fixMeta Do a server-side fix of bad or inconsistent state in hbase:meta. Available in hbase 2.2.1/2.1.6 or newer versions. Master UI has matching, new ‘HBCK Report’ tab that dumps reports generated by most recent run of catalogjanitor and a new ‘HBCK Chore’. It is critical that hbase:meta first be made healthy before making any other repairs. Fixes ‘holes’, ‘overlaps’, etc., creating (empty) region directories in HDFS to match regions added to hbase:meta. Command is NOT the same as the old hbck1 command named similarily. Works against the reports generated by the last catalog_janitor and hbck chore runs. If nothing to fix, run is a noop. Otherwise, if ‘HBCK Report’ UI reports problems, a run of fixMeta will clear up hbase:meta issues. See ‘HBase HBCK’ UI for how to generate new execute. SEE ALSO: reportMissingRegionsInMeta

    replication [OPTIONS] […] Options: -f, —fix fix any replication issues found. Looks for undeleted replication queues and deletes them if passed the ‘—fix’ option. Pass a table name to check for replication barrier and purge if ‘—fix’.

    reportMissingRegionsInMeta … To be used when regions missing from hbase:meta but directories are present still in HDFS. Can happen if user has run hbck1 ‘OfflineMetaRepair’ against an hbase-2.x cluster. This is a CHECK only method, designed for reporting purposes and doesn’t perform any fixes, providing a view of which regions (if any) would get re-added to hbase:meta, grouped by respective table/namespace. To effectively re-add regions in meta, run addFsRegionsMissingInMeta. This command needs hbase:meta to be online. For each namespace/table passed as parameter, it performs a diff between regions available in hbase:meta against existing regions dirs on HDFS. Region dirs with no matches are printed grouped under its related table name. Tables with no missing regions will show a ‘no missing regions’ message. If no namespace or table is specified, it will verify all existing regions. It accepts a combination of multiple namespace and tables. Table names should include the namespace portion, even for tables in the default namespace, otherwise it will assume as a namespace value. An example triggering missing regions execute for tables ‘table_1’ and ‘table_2’, under default namespace: $ HBCK2 reportMissingRegionsInMeta default:table_1 default:table_2 An example triggering missing regions execute for table ‘table_1’ under default namespace, and for all tables from namespace ‘ns1’: $ HBCK2 reportMissingRegionsInMeta default:table_1 ns1 Returns list of missing regions for each table passed as parameter, or for each table on namespaces specified as parameter.

    setRegionState Possible region states: OFFLINE, OPENING, OPEN, CLOSING, CLOSED, SPLITTING, SPLIT, FAILED_OPEN, FAILED_CLOSE, MERGING, MERGED, SPLITTING_NEW, MERGING_NEW, ABNORMALLY_CLOSED WARNING: This is a very risky option intended for use as last resort. Example scenarios include unassigns/assigns that can’t move forward because region is in an inconsistent state in ‘hbase:meta’. For example, the ‘unassigns’ command can only proceed if passed a region in one of the following states: SPLITTING|SPLIT|MERGING|OPEN|CLOSING Before manually setting a region state with this command, please certify that this region is not being handled by a running procedure, such as ‘assign’ or ‘split’. You can get a view of running procedures in the hbase shell using the ‘list_procedures’ command. An example setting region ‘de00010733901a05f5a2a3a382e27dd4’ to CLOSING: $ HBCK2 setRegionState de00010733901a05f5a2a3a382e27dd4 CLOSING Returns “0” if region state changed and “1” otherwise.

    setTableState Possible table states: ENABLED, DISABLED, DISABLING, ENABLING To read current table state, in the hbase shell run: hbase> get ‘hbase:meta’, ‘‘, ‘table:state’ A value of \x08\x00 == ENABLED, \x08\x01 == DISABLED, etc. Can also run a ‘describe ““‘ at the shell prompt. An example making table name ‘user’ ENABLED: $ HBCK2 setTableState users ENABLED Returns whatever the previous table state was.

    scheduleRecoveries … Schedule ServerCrashProcedure(SCP) for list of RegionServers. Format server name as ‘,,‘ (See HBase UI/logs). Example using RegionServer ‘a.example.org,29100,1540348649479’: $ HBCK2 scheduleRecoveries a.example.org,29100,1540348649479 Returns the pid(s) of the created ServerCrashProcedure(s) or -1 if no procedure created (see master logs for why not). Command support added in hbase versions 2.0.3, 2.1.2, 2.2.0 or newer.

    unassigns … Options: -o,—override override ownership by another procedure A ‘raw’ unassign that can be used even during Master initialization (if the -skip flag is specified). Skirts Coprocessors. Pass one or more encoded region names. 1588230740 is the hard-coded name for the hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example of what a userspace encoded region name looks like. For example: $ HBCK2 unassigns 1588230740 de00010733901a05f5a2a3a382e27dd4 Returns the pid(s) of the created UnassignProcedure(s) or -1 if none.

    SEE ALSO, org.apache.hbase.hbck1.OfflineMetaRepair, the offline hbase:meta tool. See the HBCK2 README for how to use.

    1. 3. 这样就看到熟悉的命令:assigns, bypass, extraRegionsInMeta,fixMeta。这些都是官方文档的内容,写的很清楚了,有时间可以慢慢看下。[https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2](https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2)
    2. ```shell
    3. # 设置表状态
    4. sudo -u hbase hbase hbck -j hbase-hbck2-1.3.0-SNAPSHOT.jar setTableState <表名> [ENABLE | DISABLED]

    2.2 修复

  6. HDFS fsck 确保 hbase跟目录下文件没有损坏丢失,如果有,则先进行坏block 移除。

image.png

  1. 看webUI中lock页面是否有锁住的producer:先尝试bypass -or +pid ,将父producer解锁,再bypass -o将子producer解锁

image.png

  1. 看webui中是否出现RIT卡在了OPENNING/CLOSING状态的region,使用assigns -o 进行重分配,生成新的producer

image.png

  1. hdfs上region不存在但是元数据存在:对regionID进行assigns或extraRegionsInMeta —fix进行修复
  2. hdfs上region存在但是元数据不存在:尝试重建元数据 addFsRegionsMissingInMeta
  3. not deployed on any region server,尝试assign此region
  4. region元数据在dn1但实际是在dn2上启动的,尝试unassign在assigns
  5. 如果出现数据空洞,There is a hole in the region chain between. You need to create a new .regioninfo and region dir in hdfs to plug the hole。
    1. 使用fixMeta命令来重建空洞和修复region重叠,(使用后没效果)
    2. 删除HDFS对应region下recovered.edits,然后执行assigns之类操作,注意会丢失部分数据,(我使用的这个,只出现了一两个region数据空洞,并且数据丢一点可以接受),网上博客说的删除所有表的尽量不要去操作,我这里只是删除了出现数据空洞的region下的recovered.edits
    3. 增加配置后重启集群:提高regionserver的线程数量,以此来提高rs处理region的能力(没测试)


  1. STUCK Region-In-Transition rit=OPENING, location=cdh5,16020,1629258162292, table=SLFJ_BAK, region=c40673e055613675a71435e86bc9fc33�
    1. sudo -u hbase hbase hbck -j hbase-hbck2-1.3.0-SNAPSHOT.jar assigns b8e8bc8c9c03c7faade2e383eab83272
    image.png

ERROR: Found lingering reference file hdfs://nameservice1/hbase/data/default/NSSFJ_BAK/b8e8bc8c9c03c7faade2e383eab83272/cf/NSSFJ=e201b14d7ee8ae008ba14d60ed2deab2-d78a580abdb84b2a97cf3fbafc339233.b3bb80e20d70b78f72fcd25df20b856b

ERROR: Region { meta => NSSFJ_BAK,NX_GD_NSSF_FJ_P3_L9_089_AI0026.20201212170329,1626750860235.b8e8bc8c9c03c7faade2e383eab83272., hdfs => hdfs://nameservice1/hbase/data/default/NSSFJ_BAK/b8e8bc8c9c03c7faade2e383eab83272, deployed => , replicaId => 0 } not deployed on any region server.�

  1. #!/bin/bash
  2. `echo "list_locks" | hbase shell 2>&1 >>locks.log`
  3. pidList=`cat locks.log | grep 'Lock type: EXCLUSIVE, procedure:' | awk -F'"' '{print $8}'`
  4. regionsUUID=`cat locks.log | grep 'REGION(' | awk -F'(' '{print $2}' | awk -F')' '{print $1}'`
  5. for i in ${pidList[@]};do
  6. echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d bypass ${i}"
  7. done
  8. echo ""
  9. for i in ${regionsUUID[@]};do
  10. echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d setRegionState -o ${i} CLOSING"
  11. (sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d setRegionState -o ${i} CLOSING)
  12. echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d unassigns -o ${i}"
  13. (sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d unassigns -o ${i})
  14. echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d assigns -o ${i}"
  15. (sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d assigns -o ${i})
  16. #echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d extraRegionsInMeta --fix -o ${i}"
  17. #(sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d extraRegionsInMeta --fix -o ${i})
  18. #echo "sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d addFsRegionsMissingInMeta -o ${i}"
  19. #(sudo -u hbase hbase hbck -j hbase-hbck2-1.0.0.1.0.0.0-406.jar -d addFsRegionsMissingInMeta -o ${i})
  20. echo ""
  21. done
  22. (rm -rf locks.log)

image.png
参考:https://cloud.tencent.com/developer/article/1940084
参考:https://blog.csdn.net/weixin_43736084/article/details/121336326
参考:https://blog.csdn.net/ddxygq/article/details/120500151
参考:https://www.modb.pro/db/54575