上一篇文章介绍了如何通过 Spark 来运行 Scala 程序,本篇文章介绍下如何通过 Spark 运行 Java 程序。
1. 环境安装
关于本篇文档的运行环境,我是通过 SDKMAN 安装的 Spark、Scala 以及 SBT,非常方便,大家自己查阅一下官网就可以很快安装好环境。
spark == 3.1.2
scala == 2.12.14
sbt == 1.5.5
jdk == 1.8.0_292
2. 构建工程项目
我们先来将工程目录创建出来
$ mkdir -p ~/projects/sparkApps/MLAccuracy/src/main/java/
$ touch ~/projects/sparkApps/MLAccuracy/src/main/java/CalcAccuracy.java
$ touch ~/projects/sparkApps/MLAccuracy/pom.xml
创建完之后的目录结构如下:
MLAccuracy
├── pom.xml
└── src
└── main
└── java
└── CalcAccuracy.java
3. 主要Java源码
计算 Accuracy 的 Java 源码(CalcAccuracy.java):
import java.util.Arrays;
import java.util.List;
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.mllib.evaluation.MulticlassMetrics;
import org.apache.spark.SparkConf;
public final class CalcAccuracy {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("CalculateAccuracy").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
List<Tuple2<Double, Double>> data = Arrays.asList(
new Tuple2<>(1.0, 0.0),
new Tuple2<>(0.0, 1.0),
new Tuple2<>(0.0, 0.0),
new Tuple2<>(0.0, 0.0),
new Tuple2<>(0.0, 0.0)
);
JavaRDD<Tuple2<Double, Double>> predictionsAndLabels = sc.parallelize(data);
MulticlassMetrics metrics = new MulticlassMetrics(predictionsAndLabels.rdd());
System.out.println(metrics.accuracy());
}
}
4. Maven打包文件
Maven 打包文件(pom.xml):
<project>
<groupId>com.y.x</groupId>
<artifactId>calulate-accuracy</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Calculate Accuracy</name>
<packaging>jar</packaging>
<version>1.0</version>
<repositories>
<repository>
<id>jboss</id>
<name>JBoss Repository</name>
<url>http://repository.jboss.com/maven2/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.14</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.1.2</version>
</dependency>
</dependencies>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.encoding>UTF-8</maven.compiler.encoding>
<java.version>1.8</java.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
</project>
在项目根目录下运行:
$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ---------------------< com.y.x:calulate-accuracy >----------------------
[INFO] Building Calculate Accuracy 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ calulate-accuracy ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/yumingmin/projects/sparkApps/MLAccuracy/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ calulate-accuracy ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 1 source file to /home/yumingmin/projects/sparkApps/MLAccuracy/target/classes
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ calulate-accuracy ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/yumingmin/projects/sparkApps/MLAccuracy/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ calulate-accuracy ---
[INFO] No sources to compile
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ calulate-accuracy ---
[INFO] No tests to run.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ calulate-accuracy ---
[INFO] Building jar: /home/yumingmin/projects/sparkApps/MLAccuracy/target/calulate-accuracy-1.0.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.308 s
[INFO] Finished at: 2021-09-22T19:15:31+08:00
[INFO] ------------------------------------------------------------------------
5. 运行程序
Spark 中运行 Java 程序还是使用 spark-submit
来运行,这里日志输出会很多,我们只截取了最后几行:
$ spark-submit --class "CalcAccuracy" ~/projects/sparkApps/MLAccuracy/target/calcuate-accuracy-1.0.jar
21/09/22 19:15:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
21/09/22 19:15:41 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1582 bytes result sent to driver
21/09/22 19:15:41 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 104 ms on master104088 (executor driver) (1/1)
21/09/22 19:15:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
21/09/22 19:15:41 INFO DAGScheduler: ResultStage 1 (collectAsMap at MulticlassMetrics.scala:61) finished in 0.118 s
21/09/22 19:15:41 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
21/09/22 19:15:41 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
21/09/22 19:15:41 INFO DAGScheduler: Job 0 finished: collectAsMap at MulticlassMetrics.scala:61, took 0.761263 s
0.6
注意:这里
--class
参数后面跟的是 CalcAccuracy.java 中的 类。
6. 参考文档
- http://spark.apache.org/docs/3.1.2/mllib-evaluation-metrics.html
- https://blog.csdn.net/youbitch1/article/details/88421965
- https://blog.csdn.net/qq_38250124/article/details/84666833
- https://blog.csdn.net/qq_38980688/article/details/101467543
- https://www.cnblogs.com/beststrive/p/10935742.html
- https://blog.csdn.net/qq_32653877/article/details/81949173
- https://blog.csdn.net/wjm_520/article/details/106089219