数据导入

虽然可以直接通过 hadoop fs -put 方式将数据文件直接上传到Hive表的存储位置，也可以使用select进行查询，但是这种方式不会修改元数据信息，会导致 select count(*)等操作返回的结果和实际不一致。

Load方式导入

向表中装载数据（load）：

load data [local] inpath 'filepath' [overwrite] into table table_name [partition (partcol1=val1, ...)];

load data：加载数据
local：从本地加载数据表到Hive；否则从 HDFS 加载数据到 Hive表
inpath：加载数据的路径
overwrite：覆盖表中的已有数据；否则表示追加
into table：加载到哪张表
partition：如果是分区表，指明要导入的分区

示例：进入hive命令行

-- 创建表
create table student(id string, name string) row format delimited fields terminated by ',';
-- 加载数据
load data local inpath '/home/tengyer/student.txt' into table student;

此时如果进行count(*)操作，会走MR：

-- 如果直接使用 hadoop fs -put方式上传，对hive是无感的，hive的元数据不会发生改变，所以执行count(*)时不会走MR，直接返回元数据的numRows，且返回值可能会和实际不符
-- 进行load操作之后，表的元数据中的numFiles发生了变化，但是并没有更新numRows。所以使用count(*)查询总数时，会启动MapReduce，然后统计出实际的行数
select count(*) from student;

如果是用的local，会将该文件上传到 hdfs 上。

如果没有添加local，则会将 hdfs 上的原始文件移动到该表的文件夹中。（实际上，hdfs也不需要进行移动，只是将 NameNode 上的文件信息进行了修改，所以速度非常快）

insert 方式插入数据

通过查询语句向表中插入数据：

-- 普通insert
insert into table student values('1001', 'zhangsan'), ('1002', 'lisi');
-- 根据单张表查询结果插入
insert overwrite table student 
select id, name from test where month='201709';

insert into：以追加数据的方式插入到表或分区，原有数据不会删除。insert into table 可以简写为 insert into
insert overwrite：会覆盖表中已存在的数据。insert overwrite table 不能省略table。

注意：

insert不支持插入部分字段；
insert 可以从A表插入A表，即允许insert into table A select * from A，可以替换其中某些字段，来避免使用update。

多表（多分区）插入模式（根据多张表查询结果）：

-- 要查询的数据源表
from student
-- 要插入的表（分区）
insert overwrite table student partition(month='201707')
-- 插入的条件
select id, name where month='201709'
-- 要插入的另一张表（分区）
insert overwrite table student partition(month='201706')
-- 插入的条件
select id, name where month='201706';

建表时使用as select方式

根据查询结果创建表：

create table if not exists student3
as
select * from student;

建表时通过location指定数据路径

先将数据文件上传到hdfs上，在创建表时指定在Hdfs上的位置。

hadoop fs -mkdir /student
hadoop fs -put student.txt /student

create external table if not exists student4 (
    id string, 
    name string
)
row format delimited
fields terminated by ','
-- 指定一个已经存在数据的数据文件
location '/student'

import数据到指定表中

先用export导出数据后，使用import将数据导入：

import table student2 from '/user/hive/warehouse/export/student';

导入的条件：

from的路径必须是export命令导出来的路径：带有_metadata、data这些
当表 student2 不存在时，会自动创建该表并导入数据
当表student2存在，且表结构（包括分隔符等）和导出的表一致，且表中没有数据，可以导入数据

数据导出

Insert导出

将查询的结果导出到本地：

-- 文件夹可以不存在，会自动创建
insert overwrite local directory '/home/tengyer/student'
select * from student;

将查询的结果格式化导出到本地：

insert overwrite local directory '/home/tengyer/student'
row format delimited
fields terminated by ','
select * from student;

将查询结果导出到 hdfs 上：

-- 不常用，一般会直接使用 hdfs的 cp、mv等命令进行操作，避免使用MR
insert overwrite directory '/student'
row format delimited
fields terminated by ','
select * from student;

hadoop命令导出

导出到本地：

hadoop dfs -get /user/hive/warehouse/student/student.txt /home/tenger/student/student.txt

Hive shell命令导出

语法：

# 实际上就是使用 hive -e|-f 执行SQL语句，然后使用Linux的 > 或 >> 将输出的内容追加到文件里
hive -f|-e 执行语句或脚本 > file

例如：

hive -e 'select * from default.student;' > /home/tengyer/student/student.txt

Export导出到HDFS

hive命令行下执行命令：

export table default.student to '/student';

export导出来的文件：

_metadata文件：存放表的元数据
data文件夹：存放表的数据文件

export 和 import 主要用于两个 Hadoop 平台集群之间的 Hive 表迁移。

Truncate清空表

Truncate只能删除管理表数据，不能删除外部表数据：

truncate table student;

大数据Hadoop

05-数据导入导出