1. 安装 Anaconda
- 下载 Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh - 安装
bash anaconda3-2021.11-Linux-x86_64.sh -b
安装提示安装即可,默认安装在 /root 下 - 配置环境变量 ```bash vim ~/.bashrc
Anaconda3
export ANACONDA_PATH=/root/anaconda3 export PATH=$PATH:$ANACONDA_PATH/bin
Anaconda3 Spark
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python
source ~/.bashrc
source activate conda deactivate
4. 测试```bashpython --versionconda -Vconda create --name jupyter python=3.7.2conda activate jupyter
2. Jupyter Notebook 使用 Spark
- 创建工作目录
mkdir -p ~/pythonwork/jupyternotebookcd ~/pythonwork/jupyternotebook
- Jupyter Notebook 界面运行 pyspark
- 本机运行 pyspark,访问 localhost:8888/tree 即可访问 Jupyter Notebook
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
错误信息:
Running as root is not recommended. Use --allow-root to bypass
- 本机运行 pyspark,访问 localhost:8888/tree 即可访问 Jupyter Notebook
# 查看配置文件路径jupyter notebook --generate-config --allow-root# 修改配置文件vim /root/.jupyter/jupyter_notebook_config.pyc.NotebookApp.open_browser = Falsec.NotebookApp.allow_root = Truec.NotebookApp.allow_origin= "*"c.NotebookApp.ip = "*"
- 测试
访问 jupyter notebook 后,创建新文件,输入In[1]: sc.master,Run 返回Out[1]: 'local[*]'Shift+Enter 运行后跳至下一 Cell
Ctrl+Enter 运行后仍在当前 Cell
测试本地文件
textFile=sc.textFile("file:/usr/local/spark/README.md")textFile.count()109
测试 HDFS 文件
textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")textFile.count()270
3. Jupyter Notebook 在 Hadoop YARN-client 模式运行
- 运行
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
- 测试 ```bash sc.master
textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count()
textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count()
<a name="1a2e714d"></a>## 4. Jupyter Notebook 在 Spark Stand Alone 模式运行1. 运行```bash/usr/local/spark/sbin/start-all.shcd ~/pythonwork/jupyternotebookPYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 2 --executor-memory 512m
- 测试 ```bash sc.master ‘spark://master:7077’
textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count() 109
textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count() 270 ```
- 访问 http://master:8080 即可查看目前 Spark 进程
