参考:
- https://blog.csdn.net/wuchenlhy/article/details/103310954
- https://blog.csdn.net/weixin_41867777/article/details/80401640?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
- 参考并运行以下docker-compose.yml
镜像里的环境(镜像自带pyspark):
singularities/spark:2.2版本中
Hadoop版本:2.8.2
Spark版本: 2.2.1
Scala版本:2.11.8
Java版本:1.8.0_151
version: "2"
services:
master:
image: singularities/spark
command: start-spark master
hostname: master
ports:
- "6066:6066"
- "7070:7070"
- "8080:8080"
- "8888:8888"
- "50070:50070"
stdin_open: true
tty: true
volumes:
- /home/pqchen/code/docker/pyspark:/home/myspark
working_dir: /home/myspark
security_opt:
- seccomp=unconfined
cap_add:
- SYS_PTRACE
worker:
image: singularities/spark
command: start-spark worker master
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:
- master
- 进入容器,安装pip和jupyter ```bash curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py # 下载安装脚本
whereis python # 查看python版本号
用哪个版本的 Python 运行安装脚本,pip 就被关联到哪个版本,如果是 Python3.5 则执行以下命令
python3.5 get-pip.py # 运行安装脚本。
3. 处理jupyter自身的配置(略)
3. 修改环境变量 .bashrc文件,添加:
```bash
export PYSPARK_PYTHON=python3.5 #使用python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --allow-root"
- 执行 pyspark 即可
- 结果:
```python
df = spark.createDataFrame([(‘1’, ‘Joe’, ‘70000’, ‘1’), (‘2’, ‘Henry’, ‘80000’, None)],
df.printSchema()['Id', 'Name', 'Sallary', 'DepartmentId'])
df.show() ```