在用 Spark 实现计算 Accuracy 时,代码在 Spark-shell 中可以运行得很好,但是想了一下,实际模型预测时,不可能是通过 Spark-shell 来实现预测,就有了本篇文章——如何在 Spark 中运行 scala 程序。
1. 环境安装
关于本篇文档的运行环境,我是通过 SDKMAN 安装的 Spark、Scala 以及 SBT,非常方便,大家自己查阅一下官网就可以很快安装好环境。
spark == 3.1.2
scala == 2.12.14
sbt == 1.5.5
2. 构建工程项目
我们先来将工程目录创建出来
$ mkdir -p ~/projects/sparkApps/MLAccuracy/src/main/scala/
$ touch ~/projects/sparkApps/MLAccuracy/src/main/scala/calculateAccuracy.scala
$ touch ~/projects/sparkApps/MLAccuracy/calculate.sbt
创建完之后的目录结构如下:
MLAccuracy
├── calculateAccuracy.sbt
└── src
└── main
└── scala
└── calculateAccuracy.scala
3. 主要代码
计算 Accuracy 的 Scala 源码(calculateAccuracy.scala):
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.{SparkConf, SparkContext}
object CalulateAccuracy {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("calculateAccuracy").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
/**
* 实际为真(P) 实际为假(N)
* 预测为真(T) 0 1
* 预测为假(F) 1 3
*/
val TN = Array((1.0, 0.0))
val FP = Array((0.0, 1.0))
val FN = new Array[(Double, Double)](3)
for (i <- FN.indices) {
FN(i) = (0.0, 0.0)
}
val all = TN ++ FP ++ FN
val predictionsAndLabels = sc.parallelize(all)
val metrics = new MulticlassMetrics(predictionsAndLabels)
println(metrics.accuracy)
}
}
SBT 打包文件(calculateAccuracy.sbt):
name := "Calculate Accuracy"
version := "1.0"
scalaVersion := "2.12.14"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.2",
"org.apache.spark" %% "spark-mllib" % "3.1.2"
)
注意:这里已经是我配置好的 sbt 文件,其实一开始并没有这里顺利,我们在下面一节详细讲解遇到的问题。
4. 使用SBT打包JAR文件
由于我使用 SDKMAN 安装的 scala 为 2.13.6,对应的 sbt 版本为 1.5.5,spark 版本为 3.12,所以一开始 sbt 文件内容如下:
name := "Calculate Accuracy"
version := "1.0"
scalaVersion := "2.13.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
4.1 打包命令
在 MLAccuracy 文件夹下运行 sbt package
就可以将项目进行打包:
$ sbt package
[info] welcome to sbt 1.5.5 (AdoptOpenJDK Java 1.8.0_292)
[info] loading project definition from /home/yumingmin/projects/sparkApps/MLAccuracy/project
[info] loading settings for project mlaccuracy from calcluateAccuracy.sbt ...
[info] set current project to Calcuate Accuracy (in build file:/home/yumingmin/projects/sparkApps/MLAccuracy/)
[info] Updating
https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.13.6/scala-library-2.13.6.pom
100.0% [##########] 1.6 KiB (1.6 KiB / s)
[info] Resolved dependencies
[warn]
[warn] Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-core_2.13:3.1.2
[error] Not found
[error] Not found
[error] not found: /home/yumingmin/.ivy2/localorg.apache.spark/spark-core_2.13/3.1.2/ivys/ivy.xml
[error] not found: https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.13/3.1.2/spark-core_2.13-3.1.2.pom
这是因为 Scala 版本与 SBT 版本兼容性存在问题,我们可以用过 sbt about
来查看所使用使用的版本:
$ sdk about
[info] This is sbt 1.5.5
[info] The current project is ProjectRef(uri("file:/home/yumingmin/projects/sparkApps/MLAccuracy/"), "mlaccuracy") 1.0
[info] The current project is built against Scala 2.13.6
[info] Available Plugins
[info] - sbt.ScriptedPlugin
[info] - sbt.plugins.CorePlugin
[info] - sbt.plugins.Giter8TemplatePlugin
[info] - sbt.plugins.IvyPlugin
[info] - sbt.plugins.JUnitXmlReportPlugin
[info] - sbt.plugins.JvmPlugin
[info] - sbt.plugins.MiniDependencyTreePlugin
[info] - sbt.plugins.SbtPlugin
[info] - sbt.plugins.SemanticdbPlugin
[info] sbt, sbt plugins, and build definitions are using Scala 2.12.14
此时依旧还是会报错的,但是我们不用管它,可以看到它说要求的 Scala 版本为 2.12.14,我们来修改一下 sbt 文件:
name := "Calculate Accuracy"
version := "1.0"
scalaVersion := "2.12.14"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2",
4.2 Spark依赖包问题
可以还是出现问题,现在报错信息如下,说明 mllib
包并未导入到环境中:
sbt package
[info] welcome to sbt 1.5.5 (AdoptOpenJDK Java 1.8.0_292)
[info] loading project definition from /home/yumingmin/projects/sparkApps/MLAccuracy/project
[info] loading settings for project mlaccuracy from calcluateAccuracy.sbt ...
[info] set current project to mlaccuracy (in build file:/home/yumingmin/projects/sparkApps/MLAccuracy/)
[info] compiling 1 Scala source to /home/yumingmin/projects/sparkApps/MLAccuracy/target/scala-2.12/classes ...
[error] /home/yumingmin/projects/sparkApps/MLAccuracy/src/main/scala/calcluateAccuracy.scala:1:12: object apache is not a member of package org
[error] import org.apache.spark.mllib.evaluation.MulticlassMetrics
解决办法:将需要的包都导入进来
name := "Calculate Accuracy"
version := "1.0"
scalaVersion := "2.12.14"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.2",
"org.apache.spark" %% "spark-mllib" % "3.1.2"
)
也可以这样写 libraryDependencies
:
name := "Calculate Accuracy"
version := "1.0"
scalaVersion := "2.12.14"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.2"
再次进行打包,发现已经可以完全编译成 JAR 文件了,在 MLAccuracy/target/scala-2.12/ 目录下生成一个 calculate-accuracy_2.12-1.0.jar 文件。
$ sbt package
[info] welcome to sbt 1.5.5 (AdoptOpenJDK Java 1.8.0_292)
[info] loading project definition from /home/yumingmin/projects/sparkApps/MLAccuracy/project
[info] loading settings for project mlaccuracy from calculateAccuracy.sbt ...
[info] set current project to Calculate Accuracy (in build file:/home/yumingmin/projects/sparkApps/MLAccuracy/)
[info] compiling 1 Scala source to /home/yumingmin/projects/sparkApps/MLAccuracy/target/scala-2.12/classes ...
[success] Total time: 10 s, completed Sep 22, 2021 3:03:16 PM
5. 运行程序
Spark 中运行 Scala 程序使用 spark-submit
来运行,这里日志输出会很多,我们只截取了最后几行:
$ spark-submit --class "CalculateAccuracy" ~/projects/sparkApps/MLAccuracy/target/scala-2.12/calcuate-accuracy_2.12-1.0.jar
21/09/22 15:11:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master104088, 38507, None)
21/09/22 15:11:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master104088, 38507, None)
0.6
注意:这里
--class
参数后面跟的是 calculateAccuracy.scala 中的 类。
如果整个程序运行没有问题,但是终端不输出结果,需要将 $SPARK_HOME/conf/log4j.properties 修改以下内容:
log4j.rootCategory=INFO, console