参考:

  1. 参考并运行以下docker-compose.yml

镜像里的环境(镜像自带pyspark):

  1. singularities/spark:2.2版本中
  2. Hadoop版本:2.8.2
  3. Spark版本: 2.2.1
  4. Scala版本:2.11.8
  5. Java版本:1.8.0_151
  1. version: "2"
  2. services:
  3. master:
  4. image: singularities/spark
  5. command: start-spark master
  6. hostname: master
  7. ports:
  8. - "6066:6066"
  9. - "7070:7070"
  10. - "8080:8080"
  11. - "8888:8888"
  12. - "50070:50070"
  13. stdin_open: true
  14. tty: true
  15. volumes:
  16. - /home/pqchen/code/docker/pyspark:/home/myspark
  17. working_dir: /home/myspark
  18. security_opt:
  19. - seccomp=unconfined
  20. cap_add:
  21. - SYS_PTRACE
  22. worker:
  23. image: singularities/spark
  24. command: start-spark worker master
  25. environment:
  26. SPARK_WORKER_CORES: 1
  27. SPARK_WORKER_MEMORY: 2g
  28. links:
  29. - master
  1. 进入容器,安装pip和jupyter ```bash curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py # 下载安装脚本

whereis python # 查看python版本号

用哪个版本的 Python 运行安装脚本,pip 就被关联到哪个版本,如果是 Python3.5 则执行以下命令

python3.5 get-pip.py # 运行安装脚本。

  1. 3. 处理jupyter自身的配置(略)
  2. 3. 修改环境变量 .bashrc文件,添加:
  3. ```bash
  4. export PYSPARK_PYTHON=python3.5 #使用python3
  5. export PYSPARK_DRIVER_PYTHON=jupyter
  6. export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --allow-root"
  1. 执行 pyspark 即可
  2. 结果: ```python df = spark.createDataFrame([(‘1’, ‘Joe’, ‘70000’, ‘1’), (‘2’, ‘Henry’, ‘80000’, None)],
    1. ['Id', 'Name', 'Sallary', 'DepartmentId'])
    df.printSchema()

df.show() ``` image.png