解析json
使用函数:get_json_object(json, path)
$ | json |
---|---|
. | json.key |
[index] | json[index] |
导入表:CTAS
create table rate
as select
get_json_object(movie_str, '$[0][0].movie'),
get_json_object(movie_str, '$[0][0].rate'),
get_json_object(movie_str, '$[0][0].timeStamp'),
get_json_object(movie_str, '$[0][0].uid')
from movie_all;
多分隔符
数据:
3952::Contender, The (2000)::Drama|Thriller
默认创建表不支持多分隔符
Storage Information
序列化反序列化:SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
输入:InputFormat: org.apache.hadoop.mapred.TextInputFormat
输出:OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
解决方案:
- mapreduce数据清洗
- 自定义
建表时手动指定:
create table movies(mid int,mname string,type string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' //序列化器
with serdeproperties(
'input.regex' = '(.*)::(.*)::(.*)', //如何解析输入的数据
'output.format,string' = '%1$s %2$s %3$s' //解析后的数据如何输出
)
stored by textfile;
结果:
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.RegexSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat