Env(3) Spark部署安装
参考文档:黑马09-spark部署文档.doc。 采用了较新的软件版本。
1 Spark Local模式搭建
在本地使用单机多线程模拟Spark集群中的各个角色。
1.1 安装包下载
官网:Apache Spark™ - Unified Engine for large-scale data analytics 截止2025.2.2,最新版为Spark 3.5.4。在主机端下载tgz安装包。
1.2 传输并解压
通过scp传输到node1:
1
scp .\spark-3.5.4-bin-hadoop3.tgz user@192.168.233.129:~/Work/2-2025
在node1端解压文件夹到某处,并更名:
1
2
3
4
# 假设解压路径为~/Work/2025
tar -zxf spark-3.5.4-bin-hadoop3.tgz -C ~/Work/2025
# 创建软连接
ln -s spark-3.5.4-bin-hadoop3.tgz spark
1.3 测试
Spark的local模式, 开箱即用:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ cd ~/Work/2025/spark
$ ./spark-shell
25/02/02 20:52:13 WARN Utils: Your hostname, ubuntu-node1 resolves to a loopback address: 127.0.1.1; using 192.168.233.129 instead (on interface ens33)
25/02/02 20:52:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/02 20:52:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = local[*], app id = local-1738500747924).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.4
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
可在(http://192.168.233.129:4040/)访问Web UI。
2 PySpark环境安装
2.1 下载Anaconda环境包
所有节点都要安装! 截止2025.2.2,较新版本是Python 3.12:Anaconda 运行下载的.sh文件,显示的用户协议可通过Ctrl+F快速翻页,输入yes,按回车。增加如下配置:
1
2
3
4
5
sudo vim /etc/profile
export ANACONDA_HOME=/home/user/anaconda3/bin
export PATH=$PATH:$ANACONDA_HOME/bin
# 重新加载环境变量
source /etc/profile
sudo vim ~/.bashrc
,添加如下内容:
1
export PATH=~/anaconda3/bin:$PATH
关闭所有终端窗口,重新打开。查看Python版本:
1
2
user@ubuntu-node1:~$ python -V
Python 3.12.7
启动或关闭anaconda环境:
1
2
conda activate # 启动
conda deactivate # 关闭
(可选)可以用conda init打开anaconda环境开机自启;用如下方法关闭开机自启:
1
2
vim ~/.bashrc
在最后添加conda deactivate
坑:不要通过sudo apt remove python3
卸载Python!会卸载一大片,之后Ubuntu图形化界面没了。。补救措施:如果在用MobaExterm保持连接,可输入命令:
1
2
sudo apt install ubuntu-minimal ubuntu-desktop
sudo apt install open-vm-tools open-vm-tools-desktop # 修复窗口自适应
3 Spark Standalone集群环境
3.1 修改配置文件
3.1.1 workers
1
2
3
cd spark/conf
cp workers.template workers
sudo vim workers
将localhost
改为:
1
2
3
node1
node2
node3
3.1.2 spark-env.sh
1
2
3
cd spark/conf
cp spark-env.sh.template spark-env.sh
sudo vim spark-env.sh
在末尾添加:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
HADOOP_CONF_DIR=/home/user/Work/2025/hadoop/etc/hadoop/
YARN_CONF_DIR=/home/user/Work/2025/hadoop/etc/hadoop/
export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
3.1.3 配置Spark应用日志
1
2
3
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/sparklog/
spark.eventLog.compress true
原文档改变log4j日志级别,但新版为log4j2,设置未找到,故未设置。
3.2 分发到其他机器
1
2
3
4
5
6
7
# 在node1执行
scp -r spark-3.5.4-bin-hadoop3/ node2:$PWD
scp -r spark-3.5.4-bin-hadoop3/ node3:$PWD
# 在node2, node3执行
# 创建软连接
ln -s ~/Work/2025/spark-3.5.4-bin-hadoop3/ ~/Work/2025/spark
3.3 启动spark Standalone
启动方式1
在主节点启动spark:
1
2
./spark/sbin/start-all.sh
# ./spark/sbin/stop-all.sh # 停止
启动HistoryServer:
1
./spark/sbin/start-history-server.sh
启动方式2
在主节点依次启动master和worker:
1
2
./spark/bin/start-master.sh
./spark/bin/start-slaves.sh
Web UI界面
HistoryServer
实测,上面Spark的部分无需hadoop即可启动,但HistoryServer需要先启动hadoop。
1
./spark/sbin/start-history-server.sh
3.4 连接集群
3.4.2 PySpark连接
在node1端执行:
1
./bin/pyspark --master spark://node1:7077
测试见黑马 - Spark 3.3节。
坑:driver和worker的Python版本不匹配
1
25/02/03 16:27:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (192.168.233.131 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/user/Work/2025/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1100, in main raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version (3, 10) than that in driver 3.12, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
解决方法:
1
2
3
4
5
sudo vim ~/.bashrc
# 末尾添加
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
# 在终端source之后再运行pyspark
source ~/.bashrc
4 Spark Standalone HA 模式安装
sudo vim ./spark/conf/spark-env.sh
:
1
2
3
4
5
# 注释或删除MASTER_HOST内容:
# SPARK_MASTER_HOST=node1
# 添加如下配置:
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha"
分发:
1
2
3
cd ./spark/conf/
scp -r spark-env.sh node2:$PWD
scp -r spark-env.sh node3:$PWD
确保Zookeeper在3个节点上都已经启动:
1
./zookeeper/bin/zkServer.sh start
在node1上启动Spark集群:
1
~/Work/2025/spark/sbin/start-all.sh
在node2上单独启动master:
1
~/Work/2025/spark/sbin/start-master.sh
在Web UI查看启动结果:
- node1 Master: (http://node1:8081),Status: ALIVE
- node2 Master: (http://node1:8082),Status: STANDBY
注:实测Zookeeper会占用8080端口,所以Spark node1启动时会尝试并使用8081端口。同时,会占用node2的8081端口作为Worker。于是node2启动Master时,会尝试并使用8082端口。具体使用的端口可以在Zookeeper启动时的日志中查看。以node1为例:
1 2 3 25/02/04 11:16:22 INFO JettyUtils: Start Jetty 0.0.0.0:8080 for MasterUI 25/02/04 11:16:23 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081. 25/02/04 11:16:23 INFO Utils: Successfully started service 'MasterUI' on port 8081.5 Spark on YARN 环境搭建
5.1 修改spark-env.sh
vim ./spark/conf/spark-env.sh
,添加如下内容:
1
2
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
如果配置过Standalone环境(见3.1.2节),则上述已配置过,可直接继续。
5.2 修改hadoop的yarn-site.xml
vim ./hadoop/etc/hadoop/yarn-site.xml
,在<configuration>
中添加: (注:在[[Env(2) Hadoop部署安装]]3.5节基础上,仅增加了”设置yarn集群的内存分配方案”部分)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
<!-- 配置yarn主节点的位置 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 设置yarn集群的内存分配方案 -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20480</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<!-- 开启日志聚合功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置聚合日志在hdfs上的保存时间 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<!-- 设置yarn历史服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://node1:19888/jobhistory/logs</value>
</property>
<!-- 关闭yarn内存检查 -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
同步:
1
2
3
cd ./hadoop/etc/hadoop
scp -r yarn-site.xml node2:$PWD
scp -r yarn-site.xml node3:$PWD
5.3 Spark设置历史服务地址
vim ./spark/conf/spark-defaults.conf
1
2
3
4
5
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/sparklog/
# 注:上面黑马原文档为"node1:9820",有误,应为8020
spark.eventLog.compress true
spark.yarn.historyServer.address node1:18080
同步(可先等5.4配置完再一起同步):
1
2
scp -r spark-defaults.conf node2:$PWD
scp -r spark-defaults.conf node3:$PWD
5.4 配置依赖spark jar包
当Spark Application应用提交运行在YARN上时,默认情况下,每次提交应用都需要将依赖Spark相关jar包上传到YARN 集群中,为了节省提交时间和存储空间,将Spark相关jar包上传到HDFS目录中,设置属性告知Spark Application应用。
1
2
hadoop fs -mkdir -p /spark/jars/
hadoop fs -put ./spark/jars/* /spark/jars/
vim spark-defaults.conf
,添加:
1
spark.yarn.jars hdfs://node1:8020/spark/jars/*
cd ./spark/conf
,同步(见5.3节)。
5.5 启动服务
在node1依次启动HDFS、YARN、MRHistoryServer和Spark HistoryServer:
1
2
3
4
5
6
7
8
9
10
cd ./hadoop/sbin
## 启动HDFS和YARN服务
./start-dfs.sh
./start-yarn.sh
# 注:On Yarn模式不需要启动Spark的Master和Worker
## 启动MRHistoryServer服务
mapred --daemon start historyserver
## 启动Spark HistoryServer服务
./start-history-server.sh
5.6 提交测试
1
2
3
4
5
6
./spark/bin/spark-submit \
--master yarn \
--conf "spark.pyspark.driver.python=/home/user/anaconda3/bin/python" \
--conf "spark.pyspark.python=/home/user/anaconda3/bin/python" \
~/Work/2025/spark/examples/src/main/python/pi.py \
10
运行结果:
1
2
3
4
25/02/04 23:04:25 INFO YarnScheduler: Killing all running tasks in stage 0: Stage finished
25/02/04 23:04:25 INFO DAGScheduler: Job 0 finished: reduce at /home/user/Work/2025/spark-3.5.4-bin-hadoop3/examples/src/main/python/pi.py:42, took 26.052944 s
Pi is roughly 3.140240
25/02/04 23:04:25 INFO SparkContext: SparkContext is stopping with exitCode 0.