1. 问题一
19/06/17 09:50:52 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container marked as failed: container_1560518528256_0014_01_000003 on host: hadoop-master. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143Container exited with a non-zero exit code 143Killed by external signal19/06/17 09:50:52 ERROR cluster.YarnScheduler: Lost executor 2 on hadoop-master: Container marked as failed: container_1560518528256_0014_01_000003 on host: hadoop-master. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143Container exited with a non-zero exit code 143Killed by external signal19/06/17 09:50:52 WARN scheduler.TaskSetManager: Lost task 22.0 in stage 0.0 (TID 17, hadoop-master, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1560518528256_0014_01_000003 on host: hadoop-master. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143Container exited with a non-zero exit code 143Killed by external signal19/06/17 09:50:52 WARN scheduler.TaskSetManager: Lost task 21.0 in stage 0.0 (TID 16, hadoop-master, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1560518528256_0014_01_000003 on host: hadoop-master. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143Container exited with a non-zero exit code 143Killed by external signal
Spark on Yarn运行时,Yarn调度的executor资源不够,所以kill掉了executor,出现lost executor的情况。
大多是 executor-memory 或者 executor-cores 设置不合理,超过了Yarn可以调度资源的最高上限(内存或者CPU核心)。
如,3台服务器,32核心,64GB内存
yarn.scheduler.maximum-allocation-mb = 68G
设置:
num-executors = 30(一个节点上运行10个executor)executor-memory = 6G (每个executor分配6G内存)executor-cores = 5 (每个executor分配5个核心)
每个节点上executor占用的内存:106.5 (有512堆内内存) = 65G 没超出上限
每个节点上executor占用的核心:105 = 50 超过上限,错误
修改 executor-cores = 3 问题解决
内存设置问题同上
2.问题二
19/10/25 10:25:14 ERROR cluster.YarnScheduler: Lost executor 9 on cdh-master: Container killed by YARN for exceeding memory limits. 9.5 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. 19/10/25 10:25:14 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 7.0 (TID 690, cdh-master, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 9.5 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
问题很明显,物理内存不够。
sudo -uhdfs spark-submit \--class com.sm.analysis.AnalysisRetained \--master yarn \--deploy-mode client \--driver-memory 3G \--driver-cores 3 \--num-executors 3 \--executor-memory 8g \--executor-cores 5 \--jars /usr/java/jdk1.8.0_211/lib/mysql-connector-java-5.1.47.jar \--conf spark.default.parallelism=30 \/data4/liujinhe/tmp/original-analysis-1.0-SNAPSHOT.jar \
本来 executor 分配 8G,加上堆外内存(9 x 1024m x 0.07 = 645m,增量512m,所以取整1024m)1G,9G内存不够,所以加大内存为10G。
