Post

Env(3) Spark部署安装

Env(3) Spark部署安装

参考文档:黑马09-spark部署文档.doc。 采用了较新的软件版本。

1 Spark Local模式搭建

在本地使用单机多线程模拟Spark集群中的各个角色。

1.1 安装包下载

官网:Apache Spark™ - Unified Engine for large-scale data analytics 截止2025.2.2,最新版为Spark 3.5.4。在主机端下载tgz安装包。

1.2 传输并解压

通过scp传输到node1:

1
scp .\spark-3.5.4-bin-hadoop3.tgz user@192.168.233.129:~/Work/2-2025

在node1端解压文件夹到某处,并更名:

1
2
3
4
# 假设解压路径为~/Work/2025
tar -zxf spark-3.5.4-bin-hadoop3.tgz -C ~/Work/2025
# 创建软连接
ln -s spark-3.5.4-bin-hadoop3.tgz spark

1.3 测试

Spark的local模式, 开箱即用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ cd ~/Work/2025/spark
$ ./spark-shell
25/02/02 20:52:13 WARN Utils: Your hostname, ubuntu-node1 resolves to a loopback address: 127.0.1.1; using 192.168.233.129 instead (on interface ens33)
25/02/02 20:52:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/02 20:52:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = local[*], app id = local-1738500747924).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.4
      /_/

Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

可在(http://192.168.233.129:4040/)访问Web UI。

2 PySpark环境安装

2.1 下载Anaconda环境包

所有节点都要安装! 截止2025.2.2,较新版本是Python 3.12:Anaconda 运行下载的.sh文件,显示的用户协议可通过Ctrl+F快速翻页,输入yes,按回车。增加如下配置:

1
2
3
4
5
sudo vim /etc/profile
export ANACONDA_HOME=/home/user/anaconda3/bin
export PATH=$PATH:$ANACONDA_HOME/bin
# 重新加载环境变量
source /etc/profile

sudo vim ~/.bashrc,添加如下内容:

1
export PATH=~/anaconda3/bin:$PATH

关闭所有终端窗口,重新打开。查看Python版本:

1
2
user@ubuntu-node1:~$ python -V
Python 3.12.7

启动或关闭anaconda环境:

1
2
conda activate    # 启动
conda deactivate  # 关闭

(可选)可以用conda init打开anaconda环境开机自启;用如下方法关闭开机自启:

1
2
vim ~/.bashrc
在最后添加conda deactivate

坑:不要通过sudo apt remove python3卸载Python!会卸载一大片,之后Ubuntu图形化界面没了。。补救措施:如果在用MobaExterm保持连接,可输入命令:

1
2
sudo apt install ubuntu-minimal ubuntu-desktop
sudo apt install open-vm-tools open-vm-tools-desktop # 修复窗口自适应

3 Spark Standalone集群环境

3.1 修改配置文件

3.1.1 workers

1
2
3
cd spark/conf
cp workers.template  workers
sudo vim workers

localhost改为:

1
2
3
node1
node2
node3

3.1.2 spark-env.sh

1
2
3
cd spark/conf
cp spark-env.sh.template spark-env.sh
sudo vim spark-env.sh

在末尾添加:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
HADOOP_CONF_DIR=/home/user/Work/2025/hadoop/etc/hadoop/
YARN_CONF_DIR=/home/user/Work/2025/hadoop/etc/hadoop/

export SPARK_MASTER_HOST=node1
export SPARK_MASTER_PORT=7077

SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081

SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"

3.1.3 配置Spark应用日志

1
2
3
spark.eventLog.enabled  true
spark.eventLog.dir      hdfs://node1:8020/sparklog/
spark.eventLog.compress true

原文档改变log4j日志级别,但新版为log4j2,设置未找到,故未设置。

3.2 分发到其他机器

1
2
3
4
5
6
7
# 在node1执行
scp -r spark-3.5.4-bin-hadoop3/ node2:$PWD
scp -r spark-3.5.4-bin-hadoop3/ node3:$PWD

# 在node2, node3执行
# 创建软连接
ln -s ~/Work/2025/spark-3.5.4-bin-hadoop3/ ~/Work/2025/spark

3.3 启动spark Standalone

启动方式1

在主节点启动spark:

1
2
./spark/sbin/start-all.sh
# ./spark/sbin/stop-all.sh  # 停止

启动HistoryServer:

1
./spark/sbin/start-history-server.sh

启动方式2

在主节点依次启动master和worker:

1
2
./spark/bin/start-master.sh
./spark/bin/start-slaves.sh

Web UI界面

(http://node1:8080/)

HistoryServer

实测,上面Spark的部分无需hadoop即可启动,但HistoryServer需要先启动hadoop。

1
./spark/sbin/start-history-server.sh

(http://node1:18080)

3.4 连接集群

3.4.2 PySpark连接

在node1端执行:

1
./bin/pyspark --master spark://node1:7077

测试见黑马 - Spark 3.3节。

坑:driver和worker的Python版本不匹配

1
25/02/03 16:27:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (192.168.233.131 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/user/Work/2025/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1100, in main raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version (3, 10) than that in driver 3.12, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

解决方法:

1
2
3
4
5
sudo vim ~/.bashrc
# 末尾添加
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
# 在终端source之后再运行pyspark
source ~/.bashrc

4  Spark Standalone HA 模式安装

sudo vim ./spark/conf/spark-env.sh

1
2
3
4
5
# 注释或删除MASTER_HOST内容:
# SPARK_MASTER_HOST=node1

# 添加如下配置:
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha"

分发:

1
2
3
cd ./spark/conf/
scp -r spark-env.sh node2:$PWD
scp -r spark-env.sh node3:$PWD

确保Zookeeper在3个节点上都已经启动:

1
./zookeeper/bin/zkServer.sh start

在node1上启动Spark集群:

1
~/Work/2025/spark/sbin/start-all.sh

在node2上单独启动master:

1
~/Work/2025/spark/sbin/start-master.sh

在Web UI查看启动结果:

  • node1 Master: (http://node1:8081),Status: ALIVE
  • node2 Master: (http://node1:8082),Status: STANDBY

注:实测Zookeeper会占用8080端口,所以Spark node1启动时会尝试并使用8081端口。同时,会占用node2的8081端口作为Worker。于是node2启动Master时,会尝试并使用8082端口。具体使用的端口可以在Zookeeper启动时的日志中查看。以node1为例:

1
2
3
25/02/04 11:16:22 INFO JettyUtils: Start Jetty 0.0.0.0:8080 for MasterUI
25/02/04 11:16:23 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
25/02/04 11:16:23 INFO Utils: Successfully started service 'MasterUI' on port 8081.

5 Spark on YARN 环境搭建

5.1 修改spark-env.sh

vim ./spark/conf/spark-env.sh,添加如下内容:

1
2
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop

如果配置过Standalone环境(见3.1.2节),则上述已配置过,可直接继续。

5.2 修改hadoop的yarn-site.xml

vim ./hadoop/etc/hadoop/yarn-site.xml,在<configuration>中添加: (注:在[[Env(2) Hadoop部署安装]]3.5节基础上,仅增加了”设置yarn集群的内存分配方案”部分)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
    <!-- 配置yarn主节点的位置 -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node1</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <!-- 设置yarn集群的内存分配方案 -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>20480</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>2.1</value>
    </property>
    
    <!-- 开启日志聚合功能 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- 设置聚合日志在hdfs上的保存时间 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
    <!-- 设置yarn历史服务器地址 -->
    <property>
        <name>yarn.log.server.url</name>
        <value>http://node1:19888/jobhistory/logs</value>
    </property>
    <!-- 关闭yarn内存检查 -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>

同步:

1
2
3
cd ./hadoop/etc/hadoop
scp -r yarn-site.xml node2:$PWD
scp -r yarn-site.xml node3:$PWD

5.3 Spark设置历史服务地址

vim ./spark/conf/spark-defaults.conf

1
2
3
4
5
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://node1:8020/sparklog/
# 注:上面黑马原文档为"node1:9820",有误,应为8020
spark.eventLog.compress                 true
spark.yarn.historyServer.address        node1:18080

同步(可先等5.4配置完再一起同步):

1
2
scp -r spark-defaults.conf node2:$PWD
scp -r spark-defaults.conf node3:$PWD

5.4 配置依赖spark jar包

当Spark Application应用提交运行在YARN上时,默认情况下,每次提交应用都需要将依赖Spark相关jar包上传到YARN 集群中,为了节省提交时间和存储空间,将Spark相关jar包上传到HDFS目录中,设置属性告知Spark Application应用。

1
2
hadoop fs -mkdir -p /spark/jars/
hadoop fs -put ./spark/jars/* /spark/jars/

vim spark-defaults.conf,添加:

1
spark.yarn.jars                         hdfs://node1:8020/spark/jars/*

cd ./spark/conf,同步(见5.3节)。

5.5 启动服务

在node1依次启动HDFS、YARN、MRHistoryServer和Spark HistoryServer:

1
2
3
4
5
6
7
8
9
10
cd ./hadoop/sbin
## 启动HDFS和YARN服务
./start-dfs.sh
./start-yarn.sh
# 注:On Yarn模式不需要启动Spark的Master和Worker

## 启动MRHistoryServer服务
mapred --daemon start historyserver
## 启动Spark HistoryServer服务
./start-history-server.sh

5.6 提交测试

1
2
3
4
5
6
./spark/bin/spark-submit \
--master yarn \
--conf "spark.pyspark.driver.python=/home/user/anaconda3/bin/python" \
--conf "spark.pyspark.python=/home/user/anaconda3/bin/python" \
~/Work/2025/spark/examples/src/main/python/pi.py \
10

运行结果:

1
2
3
4
25/02/04 23:04:25 INFO YarnScheduler: Killing all running tasks in stage 0: Stage finished
25/02/04 23:04:25 INFO DAGScheduler: Job 0 finished: reduce at /home/user/Work/2025/spark-3.5.4-bin-hadoop3/examples/src/main/python/pi.py:42, took 26.052944 s
Pi is roughly 3.140240
25/02/04 23:04:25 INFO SparkContext: SparkContext is stopping with exitCode 0.

(http://node1:8080): 运行结束后查看(http://node1:18080):

This post is licensed under CC BY 4.0 by the author.