SparkSQL:用于处理结构化数据的Spark组件,而不是SQL。所以这个名字起的并不好
schemaRDD—>DataFrame
Dataset和RDD的区别?
Python中没有对DataSet的支持,但是有DataFrame的支持
SparkSQL官方文档
一、GettingStarted
1、SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL,
access to Hive UDFs, and the ability to read data from Hive tables.
2、由于Spark2.0起,SQLContext、HiveContext已经不再推荐使用,改以SparkSession
Dataset[T] a strongly typed collection of objects (Spark需要明确知道这个T是什么类型)
DataSet需要个一个Encoder相关联,Encoder将DS转为字节并告知Spark,该DS的原有类型T和
与Spark基本数据类型直接的关系,方便Spark在后续处理中不反序列化就可以操作DS
## Encoder是自动隐式生成的
For example, given a class `Person`
* with two fields, `name` (string) and `age` (int), an encoder is used to tell Spark to generate
* code at runtime to serialize the `Person` object into a binary structure.
DataFrame = Dataset[Row] (Spark只需要模糊的知道这个类型是Row即可,所以是无类型的)
A distributed collection of data organized into named columns. Untyped Dataset
无类型的,按照命名栏位组织的分布式数据集
DataFrame引入了schema和off-heap
二、Data Sources
1、Generic Load/Save Functions
*Saving to Persistent Tables
df.write.option("path", "/some/path").saveAsTable("t")
*Bucketing, Sorting and Partitioning
usersDF
.write
.partitionBy("favorite_color")
.bucketBy(42, "name")
.saveAsTable("users_partitioned_bucketed")
2、Hive Tables
|spark.sql.sources.bucketing.enabled |true
|When false, we will treat bucketed table as normal table
上次桶表没起效果,有可能与该配置有关