方式一(推荐)
- jupyter配置
vi .jupyter/jupyter_notebook_config.py修改下述配置,其中用户名必须是 pqchen , 因为这个和集群使用的账号认证有关。IP地址设置你执行脚本的那个机器的IP.
## The IP address the notebook server will listen on.c.NotebookApp.ip = '192.168.203.43'## Username for the Session. Default is your system username.c.Session.username = u'pqchen'
- 配置环境变量
vim ~/.bashrc
#export SPARK_HOME=/home/temp/spark-2.2.0-bin-hadoop2.7#export PATH=$SPARK_HOME/bin:$PATHexport PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3.6 #使用python3.6export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/jupyter#export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --allow-root"export PYSPARK_DRIVER_PYTHON_OPTS=notebook
- 环境生效 source ~/.bashrc
启动 pyspark 即可
pyspark --executor-memory=10G \--executor-cores=5 \--driver-memory=1G \--conf spark.dynamicAllocation.maxExecutors=5
注:上述方式会导致 spark-submit 无法使用,可以通过脚本启动,在脚本中修改变量即可 ```shell
export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3 export PATH=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/:${PATH} export PYSPARK_DRIVER_PYTHON=python
spark-submit xxx.py
6. 可以执行一段pyspark程序验证环境正常工作```pythonimport randomnum_samples = 100000000def inside(p):x, y = random.random(), random.random()return x*x + y*y < 1count = sc.parallelize(range(0, num_samples)).filter(inside).count()pi = 4 * count / num_samplesprint(pi)
方式二(安装库)
- 安装 pip install findspark —user
- 打开jupyter notebook
- 添加: ```python import findspark findspark.init()
测试
import pyspark import random
sc = pyspark.SparkContext(appName=”Pi”)
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return xx + yy < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
```
