方式一(推荐)
- jupyter配置
vi .jupyter/jupyter_notebook_config.py修改下述配置,其中用户名必须是 pqchen , 因为这个和集群使用的账号认证有关。IP地址设置你执行脚本的那个机器的IP.
## The IP address the notebook server will listen on.
c.NotebookApp.ip = '192.168.203.43'
## Username for the Session. Default is your system username.
c.Session.username = u'pqchen'
- 配置环境变量
vim ~/.bashrc
#export SPARK_HOME=/home/temp/spark-2.2.0-bin-hadoop2.7
#export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3.6 #使用python3.6
export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/jupyter
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --allow-root"
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
- 环境生效 source ~/.bashrc
启动 pyspark 即可
pyspark --executor-memory=10G \
--executor-cores=5 \
--driver-memory=1G \
--conf spark.dynamicAllocation.maxExecutors=5
注:上述方式会导致 spark-submit 无法使用,可以通过脚本启动,在脚本中修改变量即可 ```shell
export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/python3 export PATH=/opt/cloudera/parcels/Anaconda-5.2.0/envs/py3.6/bin/:${PATH} export PYSPARK_DRIVER_PYTHON=python
spark-submit xxx.py
6. 可以执行一段pyspark程序验证环境正常工作
```python
import random
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
方式二(安装库)
- 安装 pip install findspark —user
- 打开jupyter notebook
- 添加: ```python import findspark findspark.init()
测试
import pyspark import random
sc = pyspark.SparkContext(appName=”Pi”)
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return xx + yy < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
```