scala

  1. 报错:WARN deploy.SparkSubmit$$anon$2: Failed to load
    1. 20/11/02 13:37:06 WARN deploy.SparkSubmit$$anon$2: Failed to load json2es.
    2. java.lang.ClassNotFoundException: json2es
    3. at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    4. at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    5. at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    6. at java.lang.Class.forName0(Native Method)
    7. at java.lang.Class.forName(Class.java:348)
    8. at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
    9. at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:810)
    10. at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    11. at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    12. at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    13. at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    14. at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    15. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    要在代码开头指定包名并在submit中添加—class
    如: ```scala package cn.test

object json2es { def main(args:Array{String}){ println(“hello world”); } }

  1. 则:
  2. ```scala
  3. spark-submit --class cn.test.json2es xx.jar

python

以下两套是常用的启动方式,配置可以互相参考,看配置名称基本能知道是什么含义,如果读者不清楚,则先要补充学习spark的基本知识和JVM基本知识。

这是我常用的一套启动方式,用于启动jupyter的

  1. kinit -kt /etc/var/keytab/cac.keytab cac/node4@HADOOP.COM
  2. myconf="/home/pqchen/RuZhi/kmeansofmissjudgment/server/conf/"
  3. pyspark --executor-memory=20G \
  4. --executor-cores=5 \
  5. --driver-memory=2G \
  6. --conf spark.dynamicAllocation.maxExecutors=5 \
  7. --conf spark.default.parallelism=200 \
  8. --conf spark.memory.fraction=0.9 \
  9. --conf spark.memory.storagefraction=0.3 \
  10. --conf spark.memory.offHeap.size=1G \
  11. --conf spark.executor.memoryOverhead=1G\
  12. --conf spark.debug.maxToStringFields=1000\
  13. --conf spark.kryoserializer.buffer.max=1500m\
  14. --conf spark.driver.maxResultSize=1500m\
  15. --conf spark.kryoserializer.buffer=1500m --jars /home/pqchen/RuZhi/kmeansofmissjudgment/server/conf/jar/mysql-connector-java-8.0.16.jar,/home/pqchen/.local/lib/python3.6/site-packages/graphframes-0.8.0-spark2.4-s_2.11.jar --driver-class-path /home/pqchen/RuZhi/kmeansofmissjudgment/server/conf/jar/mysql-connector-java-8.0.16.jar\
  16. --py-files /home/pqchen/.local/lib/python3.6/site-packages/geoip2.zip,/home/pqchen/.local/lib/python3.6/site-packages/maxminddb.zip\
  17. --files /home/pqchen/RuZhi/GeoLite2-City.mmdb

这是我常用的提交任务方式,配置齐全

  1. export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3
  2. export PATH=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/:${PATH}
  3. export PYSPARK_DRIVER_PYTHON=python
  4. RUNNINGPATH=$PWD
  5. DownloadDataDir=$RUNNINGPATH/dbresult/clusterResult/statistics`date +%Y-%m`
  6. DownloadDataRemoteHome=/home/cacboy/work/afterdelete/kmeansResult
  7. kinit -kt /etc/var/keytab/cac.keytab cac/node4@HADOOP.COM
  8. if [ -d "$RUNNINGPATH/dist" ]; then
  9. rm -f $RUNNINGPATH/dist/*
  10. fi
  11. python3 setup.py bdist_egg > /dev/null
  12. egg=$RUNNINGPATH/dist/`ls $RUNNINGPATH/dist/ `
  13. myconf="$RUNNINGPATH/server/conf/"
  14. spark-submit --executor-memory=5G \
  15. --executor-cores=10 \
  16. --driver-memory=4G \
  17. --conf spark.dynamicAllocation.maxExecutors=10 \
  18. --conf spark.default.parallelism=150 \
  19. --conf spark.memory.fraction=0.85 \
  20. --conf spark.memory.storagefraction=0.3 \
  21. --conf spark.memory.offHeap.size=2G \
  22. --conf spark.executor.memoryOverhead=2048 \
  23. --conf spark.core.connection.ack.wait.timeout=300 \
  24. --conf 'spark.driver.extraJavaOptions=-Dlog4j.configuration=file:./server/conf/log4j.properties' \
  25. --conf spark.local.dir=//home/dfs/tmp \
  26. --py-files $egg \
  27. --jars $myconf/jar/mysql-connector-java-8.0.16.jar \
  28. --driver-class-path $myconf/jar/mysql-connector-java-8.0.16.jar \
  29. ./server/cluster/main.py --dbConfigField='cac'