上一篇文章介绍了如何通过 Spark 来运行 Scala 程序,本篇文章介绍下如何通过 Spark 运行 Java 程序。
1. 环境安装
关于本篇文档的运行环境,我是通过 SDKMAN 安装的 Spark、Scala 以及 SBT,非常方便,大家自己查阅一下官网就可以很快安装好环境。
spark == 3.1.2scala == 2.12.14sbt == 1.5.5jdk == 1.8.0_292
2. 构建工程项目
我们先来将工程目录创建出来
$ mkdir -p ~/projects/sparkApps/MLAccuracy/src/main/java/$ touch ~/projects/sparkApps/MLAccuracy/src/main/java/CalcAccuracy.java$ touch ~/projects/sparkApps/MLAccuracy/pom.xml
创建完之后的目录结构如下:
MLAccuracy├── pom.xml└── src└── main└── java└── CalcAccuracy.java
3. 主要Java源码
计算 Accuracy 的 Java 源码(CalcAccuracy.java):
import java.util.Arrays;import java.util.List;import scala.Tuple2;import org.apache.spark.api.java.*;import org.apache.spark.mllib.evaluation.MulticlassMetrics;import org.apache.spark.SparkConf;public final class CalcAccuracy {public static void main(String[] args) throws Exception {SparkConf sparkConf = new SparkConf().setAppName("CalculateAccuracy").setMaster("local");JavaSparkContext sc = new JavaSparkContext(sparkConf);List<Tuple2<Double, Double>> data = Arrays.asList(new Tuple2<>(1.0, 0.0),new Tuple2<>(0.0, 1.0),new Tuple2<>(0.0, 0.0),new Tuple2<>(0.0, 0.0),new Tuple2<>(0.0, 0.0));JavaRDD<Tuple2<Double, Double>> predictionsAndLabels = sc.parallelize(data);MulticlassMetrics metrics = new MulticlassMetrics(predictionsAndLabels.rdd());System.out.println(metrics.accuracy());}}
4. Maven打包文件
Maven 打包文件(pom.xml):
<project><groupId>com.y.x</groupId><artifactId>calulate-accuracy</artifactId><modelVersion>4.0.0</modelVersion><name>Calculate Accuracy</name><packaging>jar</packaging><version>1.0</version><repositories><repository><id>jboss</id><name>JBoss Repository</name><url>http://repository.jboss.com/maven2/</url></repository></repositories><dependencies><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>2.12.14</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.12</artifactId><version>3.1.2</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.12</artifactId><version>3.1.2</version></dependency></dependencies><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.encoding>UTF-8</maven.compiler.encoding><java.version>1.8</java.version><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target></properties></project>
在项目根目录下运行:
$ mvn package[INFO] Scanning for projects...[INFO][INFO] ---------------------< com.y.x:calulate-accuracy >----------------------[INFO] Building Calculate Accuracy 1.0[INFO] --------------------------------[ jar ]---------------------------------[INFO][INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ calulate-accuracy ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory /home/yumingmin/projects/sparkApps/MLAccuracy/src/main/resources[INFO][INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ calulate-accuracy ---[INFO] Changes detected - recompiling the module![INFO] Compiling 1 source file to /home/yumingmin/projects/sparkApps/MLAccuracy/target/classes[INFO][INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ calulate-accuracy ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory /home/yumingmin/projects/sparkApps/MLAccuracy/src/test/resources[INFO][INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ calulate-accuracy ---[INFO] No sources to compile[INFO][INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ calulate-accuracy ---[INFO] No tests to run.[INFO][INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ calulate-accuracy ---[INFO] Building jar: /home/yumingmin/projects/sparkApps/MLAccuracy/target/calulate-accuracy-1.0.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 2.308 s[INFO] Finished at: 2021-09-22T19:15:31+08:00[INFO] ------------------------------------------------------------------------
5. 运行程序
Spark 中运行 Java 程序还是使用 spark-submit 来运行,这里日志输出会很多,我们只截取了最后几行:
$ spark-submit --class "CalcAccuracy" ~/projects/sparkApps/MLAccuracy/target/calcuate-accuracy-1.0.jar21/09/22 19:15:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms21/09/22 19:15:41 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1582 bytes result sent to driver21/09/22 19:15:41 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 104 ms on master104088 (executor driver) (1/1)21/09/22 19:15:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool21/09/22 19:15:41 INFO DAGScheduler: ResultStage 1 (collectAsMap at MulticlassMetrics.scala:61) finished in 0.118 s21/09/22 19:15:41 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job21/09/22 19:15:41 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished21/09/22 19:15:41 INFO DAGScheduler: Job 0 finished: collectAsMap at MulticlassMetrics.scala:61, took 0.761263 s0.6
注意:这里
--class参数后面跟的是 CalcAccuracy.java 中的 类。
6. 参考文档
- http://spark.apache.org/docs/3.1.2/mllib-evaluation-metrics.html
- https://blog.csdn.net/youbitch1/article/details/88421965
- https://blog.csdn.net/qq_38250124/article/details/84666833
- https://blog.csdn.net/qq_38980688/article/details/101467543
- https://www.cnblogs.com/beststrive/p/10935742.html
- https://blog.csdn.net/qq_32653877/article/details/81949173
- https://blog.csdn.net/wjm_520/article/details/106089219
