1. 安装 Anaconda
- 下载 Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
- 安装
bash anaconda3-2021.11-Linux-x86_64.sh -b
安装提示安装即可,默认安装在 /root 下 - 配置环境变量 ```bash vim ~/.bashrc
Anaconda3
export ANACONDA_PATH=/root/anaconda3 export PATH=$PATH:$ANACONDA_PATH/bin
Anaconda3 Spark
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python
source ~/.bashrc
source activate conda deactivate
4. 测试
```bash
python --version
conda -V
conda create --name jupyter python=3.7.2
conda activate jupyter
2. Jupyter Notebook 使用 Spark
- 创建工作目录
mkdir -p ~/pythonwork/jupyternotebook
cd ~/pythonwork/jupyternotebook
- Jupyter Notebook 界面运行 pyspark
- 本机运行 pyspark,访问 localhost:8888/tree 即可访问 Jupyter Notebook
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
错误信息:
Running as root is not recommended. Use --allow-root to bypass
- 本机运行 pyspark,访问 localhost:8888/tree 即可访问 Jupyter Notebook
# 查看配置文件路径
jupyter notebook --generate-config --allow-root
# 修改配置文件
vim /root/.jupyter/jupyter_notebook_config.py
c.NotebookApp.open_browser = False
c.NotebookApp.allow_root = True
c.NotebookApp.allow_origin= "*"
c.NotebookApp.ip = "*"
- 测试
访问 jupyter notebook 后,创建新文件,输入In[1]: sc.master
,Run 返回Out[1]: 'local[*]'
Shift+Enter 运行后跳至下一 Cell
Ctrl+Enter 运行后仍在当前 Cell
测试本地文件
textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
109
测试 HDFS 文件
textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
textFile.count()
270
3. Jupyter Notebook 在 Hadoop YARN-client 模式运行
- 运行
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
- 测试 ```bash sc.master
textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count()
textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count()
<a name="1a2e714d"></a>
## 4. Jupyter Notebook 在 Spark Stand Alone 模式运行
1. 运行
```bash
/usr/local/spark/sbin/start-all.sh
cd ~/pythonwork/jupyternotebook
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 2 --executor-memory 512m
- 测试 ```bash sc.master ‘spark://master:7077’
textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count() 109
textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count() 270 ```
- 访问 http://master:8080 即可查看目前 Spark 进程