流程
- 创建项目
- 到pom.xml文件中
看spark版本多少, 还有scala版本
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-yarn -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.13</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.28</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.13</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.13</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.13.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/druid -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.2.8</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<!-- 该插件用于将 Scala 代码编译成 class 文件 -->
<!-- https://mvnrepository.com/artifact/net.alchim31.maven/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.5.6</version>
<executions>
<execution>
<!-- 将声明绑定到maven的compile阶段 -->
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
如果是子模块的话, 需要添加scala的支持
- 创建scala类
import org.apache.spark.SparkContext //一切任务的起源,所有计算的开头
import org.apache.spark.SparkConf //spark的配置信息,相当于mr当中的那个conf,他会覆盖掉默认的配置文件(如果你进行了配置),他的主要作用,app的名字,设置运行时本地模式还是集群模式
- 写代码(参考官方文档)
如果是在windows上运行,设置setMaster(‘local[n]’)
如果是线上运行,把setMaster(‘local[n]’)去掉,或者setMaster(‘spark://master’)(不建议,因为写死了)
注意两个关键词:transformation,actionspark中saveAsTextFile如何最终生成一个文件
在默认textfile中,如果从hdfs中读取文件,源码中默认的分区数是2,如果想改变分区数,可以在textfile中设置第二个参数”分区数”
一般而言.saveAsTextFile会按照执行task的多少生成多少个文件,比如part-00000一直到part-0000n,n自然就是task的个数,也就是最后的stage的分区数
在RDD上调用coalesce(1, true).saveAsTextFile(),意味着做完计算之后将数据汇集到一个分区,然后再执行保存的动作.显然,一个分区,spark自然只起一个task来执行保存的动作.也就只有一个文件产生了,又或者可以调用repartition(1),它其实是coalesce的一个包装,默认第二个参数为true.
虽然可以这么做,但是代价是巨大的.因为spark面对的是大量的数据,并且是并行执行的.如果强行要求最后只有一个分区,必然会导致大量的磁盘IO和网络IO产生,并且最终执行reduce操作的节点的内存也会承受很大考验.spark会很慢,甚至会死掉
saveAsTextFile要求保存的目录之前是没有的,否则会报错.所以先判断一下
shell
打jar包,去掉setMaster
spark-submit \
--master spark://master:7077 \
--executor-memory 512M \
--total-executor-cores 2 \
--class org.apache.spark.WordCount \
xxx.jar \
in \
out \