1. 注意事项

  1. 使用 vscode 远程配置开发环境时,python file.py 主要依赖环境变量中 $PYTHONPATH 这个变量,因此需要设置该环境变量,否则无法导入对应pyspark 包; ```bash vim ~/.bashrc

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH

source ~/.bashrc

  1. <br />测试: <br />运行测试`python wordcount.py`<br />检查运行结果可在`/root/output`文件夹下查看
  2. <a name="b192dada"></a>
  3. ## 2. 集成开发环境测试
  4. <a name="f44a6838"></a>
  5. ### 2.1 spark local 模式测试
  6. 直接 `python wordcount.py`
  7. ```python
  8. from pyspark import SparkContext
  9. from pyspark import SparkConf
  10. def CreateSparkContext():
  11. sparkConf=SparkConf().setAppName("WordCounts").set("spark.ui.showCOnsolePrgoress","false")
  12. sc=SparkContext(conf=sparkConf)
  13. print("master="+sc.master)
  14. SetLogger(sc)
  15. SetPath(sc)
  16. return(sc)
  17. def SetLogger(sc):
  18. logger=sc._jvm.org.apache.log4j
  19. logger.LogManager.getLogger("org").setLevel(logger.Level.ERROR)
  20. logger.LogManager.getLogger("akka").setLevel(logger.Level.ERROR)
  21. logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)
  22. def SetPath(sc):
  23. global Path
  24. if sc.master[0:5]=="local":
  25. Path = "/root/pythonwork/test/"
  26. else:
  27. Path="hdfs://master:9000/user/"
  28. if __name__=="__main__":
  29. print("开始执行 WordCount")
  30. sc=CreateSparkContext()
  31. print("开始读取文本文件...")
  32. textFile=sc.textFile(Path+"test.txt")
  33. print("文件共"+str(textFile.count())+"行")
  34. countsRDD=textFile.flatMap(lambda line:line.split(" ")).map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
  35. print("文字共"+str(countsRDD.count())+"项数据")
  36. print("开始保存至文本文件...")
  37. countsRDD.saveAsTextFile("/root/pythonwork/test/output")

2.2 创建测试目录

  1. 创建测试文件上传HDFS目录
    1. mkdir -p ~/pythonwork/project/data
    2. cp /usr/local/spark/README.md ~/pythonwork/project/data
  1. 启动 Hadoop Multi Node Cluster
    1. start-all.sh
  1. 创建测试目录
    1. hadoop fs -mkdir -p /user/root/data
    2. hadoop fs -copyFromLocal ~/pythonwork/project/data/README.md /user/root/data
    3. hadoop fs -ls /user/root/data/

2.3 Spark local 模式

  1. spark-submit 常用选项
    1. --master MASTER_RUL
    2. - Local
    3. - Local[n]
    4. - Local[*]
    5. - spark://HOST:PORT
    6. - mesos L://HOST:PORT
    7. - YARN
    8. --dirver-memory MEM
    9. --executor-memory MEM
    10. --name NAME
  1. 测试 ```python vim ~/pythonwork/project/wordcount.py

from pyspark import SparkContext from pyspark import SparkConf

def CreateSparkContext(): sparkConf=SparkConf().setAppName(“WordCounts”).set(“spark.ui.showCOnsolePrgoress”,”false”) sc=SparkContext(conf=sparkConf) print(“master=”+sc.master) SetLogger(sc) SetPath(sc) return(sc)

def SetLogger(sc): logger=sc._jvm.org.apache.log4j logger.LogManager.getLogger(“org”).setLevel(logger.Level.ERROR) logger.LogManager.getLogger(“akka”).setLevel(logger.Level.ERROR) logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)

def SetPath(sc): global Path if sc.master[0:5]==”local”: Path = “/root/pythonwork/project/data/“ else: Path=”hdfs://master:9000/user/root/data/“

if name==”main“: print(“开始执行 WordCount”) sc=CreateSparkContext()

  1. print("开始读取文本文件...")
  2. textFile=sc.textFile(Path+"README.md")
  3. print("文件共"+str(textFile.count())+"行")
  4. countsRDD=textFile.flatMap(lambda line:line.split(" ")).map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
  5. print("文字共"+str(countsRDD.count())+"项数据")
  6. print("开始保存至文本文件...")
  7. countsRDD.saveAsTextFile("/root/pythonwork/project/data/output"
  1. <a name="8abbb696"></a>
  2. ### 2.4 Spark YARN-Client 模式
  3. ```bash
  4. cd ~/pythonwork/project/
  5. HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-submit --driver-memory 512m --executor-cores 1 --master yarn --deploy-mode client wordcount.py

2.5 Spark Standalone Cluster 模式

  1. cd ~/pythonwork/project/
  2. /usr/local/spark/sbin/start-all.sh
  3. spark-submit --master spark://master:7077 --deploy-mode client --executor-memory 512m --total-executor-cores 2 wordcount.py