解析json
使用函数:get_json_object(json, path)
| $ | json |
|---|---|
| . | json.key |
| [index] | json[index] |
导入表:CTAS
create table rateas selectget_json_object(movie_str, '$[0][0].movie'),get_json_object(movie_str, '$[0][0].rate'),get_json_object(movie_str, '$[0][0].timeStamp'),get_json_object(movie_str, '$[0][0].uid')from movie_all;
多分隔符
数据:
3952::Contender, The (2000)::Drama|Thriller
默认创建表不支持多分隔符
Storage Information
序列化反序列化:SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
输入:InputFormat: org.apache.hadoop.mapred.TextInputFormat
输出:OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
解决方案:
- mapreduce数据清洗
- 自定义
建表时手动指定:
create table movies(mid int,mname string,type string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' //序列化器
with serdeproperties(
'input.regex' = '(.*)::(.*)::(.*)', //如何解析输入的数据
'output.format,string' = '%1$s %2$s %3$s' //解析后的数据如何输出
)
stored by textfile;
结果:
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.RegexSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
