方式一(推荐)

  1. jupyter配置

vi .jupyter/jupyter_notebook_config.py修改下述配置,其中用户名必须是 pqchen , 因为这个和集群使用的账号认证有关。IP地址设置你执行脚本的那个机器的IP.

  1. ## The IP address the notebook server will listen on.
  2. c.NotebookApp.ip = '192.168.203.43'
  3. ## Username for the Session. Default is your system username.
  4. c.Session.username = u'pqchen'
  1. 配置环境变量

vim ~/.bashrc

  1. #export SPARK_HOME=/home/temp/spark-2.2.0-bin-hadoop2.7
  2. #export PATH=$SPARK_HOME/bin:$PATH
  3. export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3.6 #使用python3.6
  4. export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/jupyter
  5. #export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --allow-root"
  6. export PYSPARK_DRIVER_PYTHON_OPTS=notebook
  1. 环境生效 source ~/.bashrc
  2. 启动 pyspark 即可

    1. pyspark --executor-memory=10G \
    2. --executor-cores=5 \
    3. --driver-memory=1G \
    4. --conf spark.dynamicAllocation.maxExecutors=5
  3. 注:上述方式会导致 spark-submit 无法使用,可以通过脚本启动,在脚本中修改变量即可 ```shell

export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3 export PATH=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/:${PATH} export PYSPARK_DRIVER_PYTHON=python

spark-submit xxx.py

  1. 6. 可以执行一段pyspark程序验证环境正常工作
  2. ```python
  3. import random
  4. num_samples = 100000000
  5. def inside(p):
  6. x, y = random.random(), random.random()
  7. return x*x + y*y < 1
  8. count = sc.parallelize(range(0, num_samples)).filter(inside).count()
  9. pi = 4 * count / num_samples
  10. print(pi)

方式二(安装库)

  1. 安装 pip install findspark —user
  2. 打开jupyter notebook
  3. 添加: ```python import findspark findspark.init()

测试

import pyspark import random

sc = pyspark.SparkContext(appName=”Pi”) num_samples = 100000000 def inside(p):
x, y = random.random(), random.random() return xx + yy < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop() ```