1. 安装 Anaconda

  1. 下载 Anaconda
    wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
  2. 安装
    bash anaconda3-2021.11-Linux-x86_64.sh -b
    安装提示安装即可,默认安装在 /root 下
  3. 配置环境变量 ```bash vim ~/.bashrc

Anaconda3

export ANACONDA_PATH=/root/anaconda3 export PATH=$PATH:$ANACONDA_PATH/bin

Anaconda3 Spark

export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python

source ~/.bashrc

source activate conda deactivate

  1. 4. 测试
  2. ```bash
  3. python --version
  4. conda -V
  5. conda create --name jupyter python=3.7.2
  6. conda activate jupyter

2. Jupyter Notebook 使用 Spark

  1. 创建工作目录
    1. mkdir -p ~/pythonwork/jupyternotebook
    2. cd ~/pythonwork/jupyternotebook
  1. Jupyter Notebook 界面运行 pyspark
    • 本机运行 pyspark,访问 localhost:8888/tree 即可访问 Jupyter Notebook
      1. PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

      错误信息:Running as root is not recommended. Use --allow-root to bypass

  1. # 查看配置文件路径
  2. jupyter notebook --generate-config --allow-root
  3. # 修改配置文件
  4. vim /root/.jupyter/jupyter_notebook_config.py
  5. c.NotebookApp.open_browser = False
  6. c.NotebookApp.allow_root = True
  7. c.NotebookApp.allow_origin= "*"
  8. c.NotebookApp.ip = "*"
  1. 测试
    访问 jupyter notebook 后,创建新文件,输入 In[1]: sc.master,Run 返回 Out[1]: 'local[*]'

    Shift+Enter 运行后跳至下一 Cell

    Ctrl+Enter 运行后仍在当前 Cell

  • 测试本地文件

    1. textFile=sc.textFile("file:/usr/local/spark/README.md")
    2. textFile.count()
    3. 109
  • 测试 HDFS 文件

    1. textFile=sc.textFile("hdfs://master:9000/user/wordcount/input/LICENSE.txt")
    2. textFile.count()
    3. 270

3. Jupyter Notebook 在 Hadoop YARN-client 模式运行

  1. 运行
    1. PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
  1. 测试 ```bash sc.master

textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count()

textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count()

  1. <a name="1a2e714d"></a>
  2. ## 4. Jupyter Notebook 在 Spark Stand Alone 模式运行
  3. 1. 运行
  4. ```bash
  5. /usr/local/spark/sbin/start-all.sh
  6. cd ~/pythonwork/jupyternotebook
  7. PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 2 --executor-memory 512m
  1. 测试 ```bash sc.master ‘spark://master:7077’

textFile=sc.textFile(“file:/usr/local/spark/README.md”) textFile.count() 109

textFile=sc.textFile(“hdfs://master:9000/user/wordcount/input/LICENSE.txt”) textFile.count() 270 ```

  1. 访问 http://master:8080 即可查看目前 Spark 进程