Presto、Hive、Spark的SQL语法差异

Presto、Hive、Spark的SQL语法差异

官方文档

Presto：https://prestodb.github.io/docs/current/
Spark：http://spark.apache.org/docs/latest/api/sql/
Hive：https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Hive to Presto：https://prestodb.github.io/docs/current/migration/from-hive.html
Presto 使用 ANSI SQL语法和语义。
Hive 使用类似 SQL 的语言，称为 HiveQL，它在 MySQL（它本身与 ANSI SQL 有很多不同）之后进行了松散的建模。
Spark SQL 支持 HiveQL 语法。
具体差异如下，欢迎大家补充。
1.字段类型

Hive、Spark 的 String 类型对应 Presto 的 Varchar 类型
2. 列名

Spark、Hive： date 等关键字不能作为列名、别名，不支持中文别名
Presto：支持中文别名
3. 日期格式

Hive、Spark：yyyy-MM-dd HH:mm:ss
Presto：%Y-%m-%d %H:%i:%S
4. 日期函数

4.1 from_unixtime

Hive、Spark：只支持from_unixtime(unix_time, format)
Presto：支持from_unixtime(unixtime)、from_unixtime(unixtime, string)、from_unixtime(unixtime, hours, minutes)
4.2 date_add

Hive、Spark：date_add(start_date, num_days)
Presto：date_add(unit, value, timestamp)
5. JSON处理对比

Hive、Spark：select get_json_object(json, ‘$.book’);
Presto：select json_extract_scalar(json, ‘$.book’);
注意这里Presto中json_extract_scalar返回值是一个string类型,其还有一个函数json_extract是直接返回一个json串，所以使用的时候你得自己知道取的到底是一个什么类型的值.
6. Map

6.1 从 Map 中根据 Key 查找 Value

Presto：element_at(map,key)
Spark：get_json_object(to_json(extra),’$.is_baomai’)
Spark SQL 从2.4.0版本开始提供了element_at(map,key)函数，但是目前公司采用的是2.3.2版本。可以采用上述方式，先转成json，再用get_json_object(json, ‘$.book’)获取value
7. 列转行对比

Hive：select student, score from tests lateral view explode(split(scores, ‘,’)) t as score;
Presto：select student, score from tests cross json unnest(split(scores, ‘,’) as t (score);
简单的讲就是将scores字段中以逗号隔开的分数列。比如80,90,99,80，这种单列的值转换成和student列一对多的行的值映射.
8. 复杂Grouping对比

Hive：select origin_state, origin_zip, sum(package_weight) from shipping group by origin_state,origin_zip with rollup;
Presto：select origin_state, origin_zip, sum(package_weight) from shipping group by rollup (origin_state, origin_zip);
用过rollup的都知道，这是从右向左的递减的多级统计的聚合,等价于(如下为Presto写法)
select origin_state, origin_zip, sum(package_weight) from shipping group by grouping sets ((origin_state, origin_zip), (origin_state), ());
9. group by 和 order by

Hive、Spark：group by 和 order by 不支持结果列顺序号，必须写列名或表达式。
如：GROUP BY substr(dt,1,7) order by substr(dt,1,7)
Presto：同时支持列名、表达式和结果列顺序号。
如：SELECT count(), nationkey FROM customer GROUP BY 2;
SELECT count(), nationkey FROM customer GROUP BY nationkey;