2021-5-1

问题:节点尝试创建太多动态分区时发生致命错误

1、异常描述:

  1. MR执行失败任务超过20%,进程结束。


  1. 节点尝试创建太多动态分区时发生致命错误。动态分区的最大数目由hive.exec.max配置单元.动态分区以及hive.exec.max配置单元.dynamic.partitions.pernode每个节点. 最大值设置为:100


  1. total number of created files now is 100385, which exceeds 100000. Killing the job.

2、异常截图:
image.png

  1. 2021-04-06 18:11:31,880 FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"tagname":"DS02_HFECF202","point_time":"2020-10-19 06:50:06.000","value_double":0.61538,"status":0}
  2. at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
  3. at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
  4. at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
  5. at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
  6. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
  7. at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
  8. at java.security.AccessController.doPrivileged(Native Method)
  9. at javax.security.auth.Subject.doAs(Subject.java:422)
  10. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
  11. at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
  12. Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
  13. at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:897)
  14. at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:677)
  15. at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
  16. at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
  17. at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
  18. at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98)
  19. at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
  20. at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
  21. ... 9 more

image.png

  1. INFO : 2021-04-07 10:41:07,984 Stage-1 map = 100%, reduce = 26%, Cumulative CPU 697045.71 sec
  2. INFO : 2021-04-07 10:41:42,345 Stage-1 map = 100%, reduce = 27%, Cumulative CPU 698726.2 sec
  3. INFO : 2021-04-07 10:42:03,198 Stage-1 map = 100%, reduce = 28%, Cumulative CPU 700485.6 sec
  4. INFO : 2021-04-07 10:42:24,093 Stage-1 map = 100%, reduce = 29%, Cumulative CPU 702572.46 sec
  5. INFO : 2021-04-07 10:42:49,162 Stage-1 map = 100%, reduce = 30%, Cumulative CPU 705569.34 sec
  6. ERROR : [Fatal Error] total number of created files now is 100087, which exceeds 100000. Killing the job.
  7. INFO : MapReduce Total cumulative CPU time: 8 days 3 hours 59 minutes 29 seconds 340 msec
  8. ERROR : Ended Job = job_1617708913766_0003 with errors
  9. ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
  10. INFO : MapReduce Jobs Launched:
  11. INFO : Stage-Stage-1: Map: 5366 Reduce: 1099 Cumulative CPU: 705569.34 sec HDFS Read: 1538438367542 HDFS Write: 117851554854 FAIL
  12. INFO : Total MapReduce CPU Time Spent: 8 days 3 hours 59 minutes 29 seconds 340 msec
  13. INFO : Completed executing command(queryId=hive_20210407095757_325032e7-7ce2-4355-9f3d-5ef53b593f51); Time taken: 2759.992 seconds

3、解决方案:
3.1 方法一:解决Hive创建文件数过多的问题

今天将临时表里面的数据按照天分区插入到线上的表中去,出现了Hive创建的文件数大于100000个的情况,我的SQL如下:

  1. insert into table vestore_all partition(year,mounth,day) select trim(tagname),from_unixtime(unix_timestamp(point_time),'yyyy-MM-dd HH:mm:ss'),cast(value_double as double),cast(status as int),from_unixtime(unix_timestamp(point_time),'yyyy'),from_unixtime(unix_timestamp(point_time),'MM'),from_unixtime(unix_timestamp(point_time),'dd') from vestore_tmp;

vestore_tmp表里面一共有1435G左右的数据,一共可以分成10*31=310个分区,SQL运行的时候创建了5366个Mapper,0个Reducers。程序运行到一般左右的时候出现了以下的异常,并最终导致了SQL的运行失败。

  1. [Fatal Error] total number of created files now is 100385, which exceeds 100000. Killing the job.

这个错误的原因是因为Hive对创建文件的总数有限制(hive.exec.max.created.files),默认是100000个,而这个SQL在运行的时候每个Map都会创建310个文件,对应了每个分区,所以这个SQL总共会创建5366 * 310 = 1663460个文件,运行中肯定会出现上述的异常。为了能够成功地运行上述的SQL,最简单的方法就是加大hive.exec.max.created.files参数的设置。
20190315140951222.png
在执行插入数据到分区时,添加参数设置:

  1. set hive.exec.dynamic.partition=true;
  2. set hive.exec.dynamic.partition.mode=nonstrict;
  3. # set hive.exec.max.dynamic.partitions.pernode=10000;
  4. # set hive.exec.max.dynamic.partitions=10000;
  5. set hive.exec.max.created.files=2000000;

但是这有个问题,这会导致在iteblog中产生大量的小文件,因为vestore_tmp表的数据就有1435G左右,那么平均每个文件的大小=1435G*1024/1663460= 0.0883363591550142MB,可想而知,160多万个这么小的小文件对Hadoop来说是多么不好。那么有没有好的办法呢?有!

我们可以将day相同的数据放到同一个Reduce处理,这样最多也就产生31个文件,将day相同的数据放到同一个Reduce可以使用DISTRIBUTE BY day实现,所以修改之后的SQL如下:

  1. insert into table vestore_all partition(year,mounth,day) select trim(tagname),from_unixtime(unix_timestamp(point_time),'yyyy-MM-dd HH:mm:ss'),cast(value_double as double),cast(status as int),from_unixtime(unix_timestamp(point_time),'yyyy'),from_unixtime(unix_timestamp(point_time),'MM'),from_unixtime(unix_timestamp(point_time),'dd') from vestore_tmp DISTRIBUTE BY day;

修改完之后的SQL运行良好,并没有出现上面的异常信息,但是这里也有个问题,因为这31个分区的数据分布很不均匀,有些Reduce的数据有100多G,而有些Reduce只有几G,直接导致了这个SQL运行的速度很慢!
能不能将1435G多的数据均匀的分配给Reduce呢?可以!我们可以使用DISTRIBUTE BY rand()将数据随机分配给Reduce,这样可以使得每个Reduce处理的数据大体一致。我设定每个Reduce处理3G的数据,对于1435G多的数据总共会起478左右的Reduces,修改的SQL如下:

  1. set hive.exec.dynamic.partition=true;
  2. set hive.exec.dynamic.partition.mode=nonstrict;
  3. set hive.exec.max.dynamic.partitions.pernode=10000;
  4. set hive.exec.max.dynamic.partitions=10000;
  5. set hive.exec.max.created.files=200000;
  6. set hive.exec.reducers.bytes.per.reducer=3221225472;
  7. insert into table vestore_all partition(year,mounth,day) select trim(tagname),from_unixtime(unix_timestamp(point_time),'yyyy-MM-dd HH:mm:ss'),cast(value_double as double),cast(status as int),from_unixtime(unix_timestamp(point_time),'yyyy'),from_unixtime(unix_timestamp(point_time),'MM'),from_unixtime(unix_timestamp(point_time),'dd') from vestore_tmp DISTRIBUTE BY rand();

这个SQL运行的时间很不错,而且生产的文件数量为Reduce的个数分区的个数(287310),≈9W个文件。

2021-5-2

异常描述:

问题:总分配超过堆内存的50.00%(1,431,830,528字节)将131个写入器的行组大小扩展为8.14%

异常详情:
image.png
解决方案:

  1. 修改增加ResourceManager最大容器内存:yarn.scheduler.maximum-allocation- mb=32GB
  2. 修改增加NodeManager容器内存:yarn.nodemanager.resource.memory-mb=80GB

2021-5-3

问题:HIVE SQL执行时候报return code 2错误解决方案

  1. 18/07/29 00:53:04 ERROR operation.Operation: Error running hive query:
  2. org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
  3. at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
  4. at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
  5. at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
  6. at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
  7. at java.security.AccessController.doPrivileged(Native Method)
  8. at javax.security.auth.Subject.doAs(Subject.java:415)
  9. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  10. at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
  11. at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  12. at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  13. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  14. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  15. at java.lang.Thread.run(Thread.java:745)

目前具体报错原因未知,可在执行sql语句前加上这句解决上面问题:

  1. set hive.support.concurrency=false;

2021-5-31

问题:hive2连接但无法使用,show database、show tables报错

报错信息:Error: Error running query: j ava. lang . NoClassDefFoundError: Could not initialize class org . apache . at las .hive . hook .HiveHook ( state=,code= 0)
image.png
解决方式:将atlas配置文件复制到最近的一个hiveserver2运行配置服务
image.png
如果还是无效,重启该节点的hiveserver2服务,将配置文件复制到最新的运行配置目录下。

2021-07-12

问题:无法插入带分区的hive

报错:AnalysisException: Not enough partition columns mentioned in query. Missing columns are: splittime
原因:分区未指定
image.png
解决:指定partiton字段
image.png

  1. INSERT INTO dcs_wtr31_bakup
  2. partition(splittime)
  3. SELECT * FROM realtimedcs.wtr31;

2021-11-15

问题:Hive\Impala\Kudu无法支持中文

解决:修改MySQL中元数据库的表字段字符集。

  1. 修改数据库注释,支持中文表示:

    1. ALTER TABLE DBS MODIFY COLUMN `DESC` VARCHAR (4000) CHARACTER SET utf8;
    2. FLUSH PRIVILEGES;
  2. 修改表名注释,支持中文表示

    1. alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
    2. FLUSH PRIVILEGES;
  3. 修改分区参数,支持分区建用中文表示

    1. alter table PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
    2. alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
    3. FLUSH PRIVILEGES;
  4. 修改表注释字符集

    1. alter table INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
    2. FLUSH PRIVILEGES;
  5. 修改字段注释字符集

    1. alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
    2. FLUSH PRIVILEGES;