警告

部分来自 https://dongkelun.com/2019/01/09/sparkExceptions/

1. spark.executor.memoryOverhead

堆外内存(默认是executor内存的10%),当数据量比较大的时候,如果按默认的就会有下面的异常,导致程序崩溃

  1. Container killed by YARN for exceeding memory limits. 1.8 GB of 1.8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

解决:
具体值根据实际情况配置

  1. 新版:
  2. --conf spark.executor.memoryOverhead=2048
  3. 旧版:
  4. --conf spark.yarn.executor.memoryOverhead=2048
  5. 如果新版用旧版的配置会报以下警告
  6. WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.

2.No more replicas available for rdd_

  1. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_3250_73 !
  2. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_12_38 !
  3. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_3250_38 !
  4. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_3250_148 !
  5. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_3250_6 !
  6. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_3250_112 !
  7. 19/01/08 12:36:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_12_100 !

解决:增大executor的内存

  1. --executor-memory 4G

3.

报错

1.SPARK

  1. ValueError: Couldn’t find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
    问题出现于:使用jupyter notebook调用pyspark时,找不到spark。
    原因:先启动jupyter notebook(记为A),再启动pyspark,导致为pyspark配置好的jupyter notebook(记为B)被A占据了原先配置的端口。
    解决:1. 用ps -ef | grep jupyter 查找运行中的jupyter,然后kill掉它。 最后使用pyspark启动环境即可。

    1. 附:为pyspark配置好的jupyter notebook 的过程<br /> vi ~/.brasrc中添加下面两句<br /> export PYSPARK_DRIVER_PYTHON=jupyter<br /> export PYSPARK_DRIVER_PYTHON_OPTS=notebook<br /> 使用source ~/.brasrc重置环境。<br /> 最后使用pyspark启动spark环境。

2.File does not exist

java.io.FileNotFoundException: File does not exist: hdfs://master:8020/user/hive/warehouse/trans_parquet/ymd=2019-08-07/h=14/m=00/part-00000-722ae06f-a562-4ee6-9d36-d0f6e50282e0-c000
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running ‘REFRESH TABLE tableName’ command in SQL or by recreating the Dataset/DataFrame involved.
【报错原因】同时读写这个文件
【解决方式】方案1.另存这个文件,再读。
方案2.在python上添加一句:

  1. parquetFiles = spark.read.parquet("/user/hive/warehouse/trans_parquet/ymd={0}/".format(date.replace("'", "")))
  2. parquetFiles.createOrReplaceTempView("temp")

3. 报没有数据的错误

可能的情况:

  1. 读入数据时,就没有数据。 —shell脚本启动时,不知为何存在一个’\r’字符,导致.py脚本的输入参数格式错误。建议重新写过这个shell。—代码运行时,应该时过滤数据导致没有数据了。这种时候用try..except…捕获异常进行处理。

4. Apache Spark 作业无限挂起

太长不看版:因为存在非确定性自定义 UDF ,加个.cache() 即可

挂起特征可查看DEBUG级别的日志,会不停重复以下类似片段(正常情况重复不超过10次)

  1. 19/12/19 14:16:59 DEBUG SaslRpcClient: unwrapping token of length:1999
  2. 19/12/19 14:16:59 DEBUG Client: IPC Client (376159217) connection to master/192.168.203.42:8020 from cac/node4@HADOOP.COM got value #327
  3. 19/12/19 14:16:59 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 16ms
  4. 19/12/19 14:16:59 DEBUG SaslRpcClient: reading next wrapped RPC packet
  5. 19/12/19 14:16:59 DEBUG Client: IPC Client (376159217) connection to master/192.168.203.42:8020 from cac/node4@HADOOP.COM sending #328 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations
  6. 19/12/19 14:16:59 DEBUG SaslRpcClient: wrapping token of length:215
  7. 19/12/19 14:16:59 DEBUG SaslRpcClient: Sending sasl message state: WRAP
  8. token: "`\202\001\003\006\t*\206H\206\367\022\001\002\002\002\001\021\000\020\000\377\377Y\3515\225/\027\251\211\340&\237\312{~\217\272[AA\360*t\034\275\231\341\r\350\002\320-;\242\366\3764k\372Rq\024\213\
  9. 316\316\306O/A\347HY\264@WK\n\227\270\345\206h\306\335h|\262\030\027\330\033\242\026\216\341\370\376\337\276dI\n\352\263q\\=O0c\342\270\361\247\031B\376V\272YK\352,\357BQ\243u\266\371\275>\027\022\317\
  10. 215b\361H\252<4s\\\376\031\230\006\247\212&\217\005\006\245(S{\242\244d.K\250]$\332\364\331\241\006\004&X\217\0347\203\203b\b\216y\355`\036\027c\343\223\327\004\363\"\355Nw \206\352\313r\b\256\325Xj\315r\v
  11. A\372\314\\\233\\\a4\371\024\273$[*\235\026\314\315\220\253\374\347\342z\305\252\274\250H\371`\017\204\206g\320\277\343\204\357\206\334\000\002}\021\027\204\331\337"

问题
有时 Apache Spark 作业会因为 Spark 用户定义函数(UDF)的不确定性行为而无限期挂起。 下面是此类函数的一个示例:

  1. val convertorUDF = (commentCol: String) =>
  2. {
  3. #UDF definition
  4. }
  5. val translateColumn = udf(convertorUDF)

如果使用 withColumn() API 调用此 UDF,然后对生成的 DataFrame应用某些筛选器转换,则 UDF 可能会对每个记录执行多次,从而影响应用程序性能。

  1. val translatedDF = df.withColumn("translatedColumn", translateColumn( df("columnToTranslate")))
  2. val filteredDF = translatedDF.filter(!translatedDF("translatedColumn").contains("Invalid URL Provided"))
  3. && !translatedDF("translatedColumn").contains("Unable to connect to Microsoft API"))

原因
有时,确定性 UDF 可能会表现不确定地,并根据 UDF 的定义执行重复调用。 当你在数据帧上使用 UDF 来使用 withColumn() API 添加其他列,然后对生成的 DataFrame应用转换(筛选器)时,通常会出现此行为。
解决方案
Udf 必须是确定性的。 由于优化,可能会消除重复调用,或调用函数比查询中出现的次数更多。
更好的选择是缓存使用 UDF 的 DataFrame。 如果 DataFrame 包含大量数据,则将其写入 Parquet 格式文件是最佳的。
你可以使用以下代码来缓存结果:
val translatedDF = df.withColumn(“translatedColumn”, translateColumn( df(“columnToTranslate”))).cache()

5.Failed to allocate a page

  1. 19/01/09 09:12:39 WARN TaskMemoryManager: Failed to allocate a page (1048576 bytes), try again.
  2. 19/01/09 09:12:41 WARN TaskMemoryManager: Failed to allocate a page (1048576 bytes), try again.
  3. 19/01/09 09:12:41 WARN NioEventLoop: Unexpected exception in the selector loop.
  4. java.lang.OutOfMemoryError: GC overhead limit exceeded
  5. at java.lang.Integer.valueOf(Integer.java:832)
  6. at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:120)
  7. at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
  8. at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
  9. at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
  10. at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
  11. at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
  12. at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
  13. at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
  14. at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
  15. at java.lang.Thread.run(Thread.java:748)

解决:增大driver的内存

  1. --driver-memory 6G

6.Uncaught exception in thread task-result-getter-3

  1. 19/01/10 09:31:50 ERROR Utils: Uncaught exception in thread task-result-getter-3
  2. java.lang.OutOfMemoryError: Java heap space

解决:增大driver的内存(同上)

7.spark.driver.maxResultSize

  1. 19/05/20 11:49:54 ERROR AsyncEventQueue: Dropping event from queue eventLog. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
  2. 19/05/20 11:49:54 WARN AsyncEventQueue: Dropped 1 events from eventLog since Thu Jan 01 08:00:00 CST 1970.

解决:增大spark.scheduler.listenerbus.eventqueue.capacity(默认为10000)

  1. 新版用
  2. --conf spark.scheduler.listenerbus.eventqueue.capacity=100000
  3. 旧版用
  4. spark.scheduler.listenerbus.eventqueue.size
  5. 19/05/21 14:38:15 WARN SparkConf: The configuration key 'spark.scheduler.listenerbus.eventqueue.size' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.scheduler.listenerbus.eventqueue.capacity' instead.
  6. 具体看参考

8. Kryo serialization failed 序列化数据问题

org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1114384. To avoid this, increase spark.kryoserializer.buffer.max value.
序列化数据时,因为文件过大而报错。需要添加以下配置,视文件大小调整值

  1. --conf spark.kryoserializer.buffer.max=256m
  2. --conf spark.kryoserializer.buffer=64m

9. 调用自定义函数出现 ImportError错误

  1. File "/usr/local/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 166, in main
  2. func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  3. File "/usr/local/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 55, in read_command
  4. command = serializer._read_with_length(file)
  5. File "/usr/local/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
  6. return self.loads(obj)
  7. File "/usr/local/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
  8. return pickle.loads(obj, encoding=encoding)
  9. File "/tmp/spark-426b96eb-d2d0-4fde-9e51-0865d244084a/userFiles-f4b69f58-7ab5-4f74-9cf4-ad373b715678/modelProfile.py", line 22, in <module>
  10. from util import splitToken, sentenceClear
  11. ImportError: No module named 'util'

原因:
spark官方文档中的描述如下

  • Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future
  • 字面意思,就是将加入的.py文件在后续为这个SparkContext所有的任务执行时提供依赖。
  • 我的理解,spark有一个依赖库,addPyFile之后为在依赖库里生成一个临时文件,供程序调用。这也是为什么不进行这一步,直接import xxxx后运行程序时,spark会报错“ImportError: No module name xxxx”。

解决方案1:

  1. from pyspark.context import SparkContext
  2. sc = SparkContext.getOrCreate()
  3. sc.addPyFile('util.py')

解决方案2:
使用闭包形式处理。这样就不用导入文件,避免因路径出现的问题。
解决方案3:
官方文档 https://spark.apache.org/docs/2.4.0/rdd-programming-guide.html Passing Functions to Spark 部分