解析json

使用函数:get_json_object(json, path)

$ json
. json.key
[index] json[index]

导入表:CTAS

  1. create table rate
  2. as select
  3. get_json_object(movie_str, '$[0][0].movie'),
  4. get_json_object(movie_str, '$[0][0].rate'),
  5. get_json_object(movie_str, '$[0][0].timeStamp'),
  6. get_json_object(movie_str, '$[0][0].uid')
  7. from movie_all;

多分隔符

数据:

3952::Contender, The (2000)::Drama|Thriller

默认创建表不支持多分隔符

Storage Information
序列化反序列化:SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
输入:InputFormat: org.apache.hadoop.mapred.TextInputFormat
输出:OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

解决方案:

  1. mapreduce数据清洗
  2. 自定义

建表时手动指定:

create table movies(mid int,mname string,type string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'     //序列化器
with serdeproperties(
'input.regex' = '(.*)::(.*)::(.*)',                             //如何解析输入的数据
'output.format,string' = '%1$s %2$s %3$s'                       //解析后的数据如何输出
)
stored by textfile;

结果:
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.RegexSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat