:::tips
2注意⚠️
以 Hadoop Catalog 为例
hostname 设置为 datalake
:::
安装 Hadoop
安装并启动 Flink
下载 Flink 1.14.4 (使用的是 scala_2.12 版本)并将其解压至目录 flink-1.14.4
wget https://dlcdn.apache.org/flink/flink-1.14.4/flink-1.14.4-bin-scala_2.12.tgz
tar -xvf flink-1.14.4-bin-scala_2.12.tgz
打开配置文件
vi /etc/profile
添加环境变量 HADOOP_CLASSPATH
# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
启动 Flink 集群
cd flink-1.14.4
bin/start-cluster.sh
启动 Flink SQL Client
使用 hadoop catalog(默认支持 hadoop catalog)
下载 iceberg-flink-runtime 依赖包(flink-1.14.4 目录下)
wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.1/iceberg-flink-runtime-1.14-0.13.1.jar
启动 Flink SQL Client 时加载 iceberg-flink-runtime
./bin/sql-client.sh -j iceberg-flink-runtime-1.14-0.13.1.jar
使用 hive catalog
如果要使用 hive catalog,我们还需要下载 flink-sql-connector-hive 依赖包(flink-1.14.4 目录下) ```scala wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.1/iceberg-flink-runtime-1.14-0.13.1.jar
- 启动 Flink SQL Client 时加载 iceberg-flink-runtime 和 flink-sql-connector-hive
```scala
./bin/sql-client.sh -j flink-sql-connector-hive-2.3.6_2.12-1.14.4.jar -j iceberg-flink-runtime-1.14-0.13.1.jar
:::info
备注:
也可以将 Iceberg 的依赖包下载 到 Flink 的 lib 目录下,然后重启 Flink,这样启动 Flink SQL Client 的时候可以不使用 -j 加载 Iceberg 的依赖包
:::
创建并使用 catalog
创建 Hive catalog
使用 Hive 的 metastore 保存元数据,HDFS 保存数据库表的数据
CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://datalake:9083',
'clients'='5',
'property-version'='1',
'warehouse'='hdfs://datalake:9000/warehouse/hive'
);
:::info
- type:必须是 iceberg (必需的)。
- catalog-type:hive 或 hadoop(默认),或者不设置并通过 catalog-impl 来使用自定义 catalog 。
- uri:hive metastore 地址。
- clients:在 hive metastore 中,hive_catalog 供客户端访问的连接池大小,默认是 2。
- property-version:描述属性版本的版本号。 如果属性格式发生变化,此属性可用于向后兼容。 当前属性版本为 1。
- warehouse:Flink 集群所在的 HDFS 路径,hive_catalog 下的数据库表存放数据的位置,优先级比 hive-conf-dir 的高。
- cache-enabled:是否启用 catalog 缓存,默认值为 true。
- hive-conf-dir:hive 集群的配置目录(只能是本地路径)。使用从 hive-site.xml 解析出来的 HDFS 路径。 :::
创建 Hadoop catalog
使用 HDFS 保存元数据和数据库表的数据
CREATE CATALOG hadoop_catalog WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://datalake:9000/warehouse/hadoop',
'property-version'='1'
);
:::info
- type:必须是 iceberg (必需的)。
- catalog-type:hive 或 hadoop(默认),或者不设置并通过 catalog-impl 来使用自定义 catalog 。
- warehouse:Flink 集群所在的 HDFS 路径,hadoop_catalog 下的数据库表存放元数据和数据的位置。
- property-version:描述属性版本的版本号。 如果属性格式发生变化,此属性可用于向后兼容。 当前属性版本为 1。
cache-enabled:是否启用 catalog 缓存,默认值为 true。 :::
创建 catalog 会在 HDFS 目录上创建 hadoop 子目录,以及 hadoop/default 子目录(catalog 默认都有一个 default 数据库)
hdfs dfs -ls -R /warehouse
drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop
drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default
:::info 通过配置 conf/sql-cli-defaults.yaml 实现启动 Flink SQL Client 时创建 catalog。但测试的时候并未生效。
catalogs:
- name: hadoop_catalog
type: iceberg
catalog-type: hadoop
property-version: 1
cache-enabled: true
warehouse: hdfs://datalake:9000/warehouse/hadoop
:::
使用 catalog
USE CATALOG hadoop_catalog;
DDL 命令
数据库
查看数据库
catalog 默认都有一个 default 数据库
SHOW DATABASES;
+---------------+
| database name |
+---------------+
| default |
+---------------+
1 row in set
创建数据库
- 如果我们不想在默认数据库下创建表,可以创建单独的数据库 ```sql CREATE DATABASE iceberg_db;
SHOW DATABASES; +———————-+ | database name | +———————-+ | default | | iceberg_db | +———————-+ 2 rows in set
- 创建数据库会在 HDFS 目录上创建 iceberg_db 子目录
```sql
hdfs dfs -ls -R /warehouse/hadoop
drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default
drwxr-xr-x - root supergroup 0 2022-04-22 16:58 /warehouse/hadoop/iceberg_db
删除数据库
- 删除数据库 ```sql DROP DATABASE iceberg_db;
SHOW DATABASES; +———————-+ | database name | +———————-+ | default | +———————-+ 1 row in set
- 如果删除数据库,会删除 HDFS 上的 iceberg_db 子目录
```sql
hdfs dfs -ls -R /warehouse/hadoop
drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default
使用数据库
USE CATALOG hadoop_catalog;
USE iceberg_db;
或者
USE hadoop_catalog.iceberg_db;
表
创建表
- 创建表 ```sql USE hadoop_catalog.iceberg_db;
CREATE TABLE iceberg_table ( id BIGINT COMMENT ‘unique id’, data STRING );
或者
```sql
CREATE TABLE hadoop_catalog.iceberg_db.iceberg_table (
id BIGINT COMMENT 'unique id',
data STRING
);
:::info 目前表不支持的功能
- 隐藏分区
- 计算列
- watermark
- 添加列、删除列、重命名列、更改列。 FLINK-19062 正在跟踪 :::
:::info
可以通过 format-version 指定 format
Version 1: appent 表,可以插入所有测试类型字段,但是查询 TINYINT,SMALLINT 数据类型失败,不能设置主键进行 upsert
Version 2: upsert 表,开启 upsert 时,不可插入TINYINT,SMALLINT,主键必须指定分区字段,因为 delete writer 必须知道写入的分区是什么,所以主键才必须包含分区使用的字段,会根据 key 来修改数据
:::
- 创建表会在 HDFS 目录上创建 iceberg_table 子目录
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db
drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
创建分区表
CREATE TABLE hadoop_catalog.iceberg_db.iceberg_table_pt (
id BIGINT COMMENT 'unique id',
data STRING,
category STRING
) PARTITIONED BY (category);
:::info Iceberg 支持隐藏分区,而 Flink 不支持按列的函数进行分区,现在无法在 Flink DDL 中支持隐藏分区 :::
复制表
USE hadoop_catalog.iceberg_db;
CREATE TABLE iceberg_table_copy LIKE iceberg_table;
复制的表拥有相同的表结构、分区、表属性
修改表
:::info
注意⚠️
目前只支持修改 iceberg 的表属性和重命名表
:::
- 修改表属性 ```sql USE hadoop_catalog.iceberg_db; ALTER TABLE iceberg_table_copy SET (‘write.format.default’=’avro’);
SHOW CREATE TABLE iceberg_table_copy;
CREATE TABLE hadoop_catalog
.iceberg_db
.iceberg_table_copy
(
id
BIGINT,
data
VARCHAR(2147483647)
) WITH (
‘write.format.default’ = ‘avro’
)
- 重命名表(Hadoop Catalog 中的表不支持重命名表)
```sql
USE hadoop_catalog.iceberg_db;
ALTER TABLE iceberg_table_copy RENAME TO iceberg_table_copy_new;
[ERROR] Could not execute SQL statement. Reason:
java.lang.UnsupportedOperationException: Cannot rename Hadoop tables
删除表
删除表
USE hadoop_catalog.iceberg_db;
DROP TABLE iceberg_table_copy;
如果删除表,会删除 HDFS 上的 iceberg_table_copy 子目录
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db
drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
写入
INSERT INTO
- 写入数据 ```sql USE hadoop_catalog.iceberg_db; INSERT INTO iceberg_table(id, data) VALUES (1, ‘a’); INSERT INTO iceberg_table(id, data) VALUES (2, ‘b’);
INSERT INTO iceberg_table SELECT (id + 2), data FROM iceberg_table;
- 查看数据,可以看到每条 SQL 都会生成一个 metadata.json 以及 parquet 文件
```sql
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data
-rw-r--r-- 3 root supergroup 658 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-81dd1531-9c34-4108-b97d-0a0dedd2270f-00001.parquet
-rw-r--r-- 3 root supergroup 658 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-d92a76a2-cbf9-4b20-89af-87706ede571e-00001.parquet
-rw-r--r-- 3 root supergroup 657 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-e507ea82-15a5-44d5-a5c0-2f475c877599-00001.parquet
drwxr-xr-x - root supergroup 0 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 5798 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/21564066-ba86-46a2-944c-48a20171ddf4-m0.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/27a5e943-70f1-4c10-b52f-f75f9772c9db-m0.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/3ad437f2-0db8-4096-8419-6e9845253dc8-m0.avro
-rw-r--r-- 3 root supergroup 3890 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-1586352819259469942-1-21564066-ba86-46a2-944c-48a20171ddf4.avro
-rw-r--r-- 3 root supergroup 3842 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-4988714748194111917-1-3ad437f2-0db8-4096-8419-6e9845253dc8.avro
-rw-r--r-- 3 root supergroup 3776 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6065247658548389325-1-27a5e943-70f1-4c10-b52f-f75f9772c9db.avro
-rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 2218 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
-rw-r--r-- 3 root supergroup 3295 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
-rw-r--r-- 3 root supergroup 4372 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
INSERT OVERWRITE
:::info 只支持 Batch 模式,且 overwrite 粒度为 partition,替换的是整个分区,而不是一行数据。如果不是分区表,则替换的是整个表。 :::
普通表
- iceberg_table 不是分区表,所以替换的是整个表 ```sql SET sql-client.execution.result-mode = tableau; SET ‘execution.runtime-mode’ = ‘batch’;
INSERT OVERWRITE iceberg_table VALUES (1, ‘a’);
SELECT * FROM iceberg_table; +——+———+ | id | data | +——+———+ | 1 | a | +——+———+ 1 row in set
- 查看数据,新增了一个 v5.metadata.json
```sql
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
-rw-r--r-- 3 root supergroup 5590 2022-04-22 17:43 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v5.metadata.json
分区表
- 插入数据 ```bash USE hadoop_catalog.iceberg_db;
INSERT INTO iceberg_table_pt(id, data, category) VALUES (1, ‘a’, ‘C1’), (2, ‘b’, ‘C1’), (3, ‘c’, ‘C2’), (4, ‘d’, ‘C2’);
SELECT * FROM iceberg_table_pt;
+——+———+—————+ | id | data | category | +——+———+—————+ | 1 | a | C1 | | 2 | b | C1 | | 3 | c | C2 | | 4 | d | C2 | +——+———+—————+ 4 rows in set
- 不同分区的数据保存在对应分区的文件夹下(category=C1 和 category=C2)
```bash
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table_pt
drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data
drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C1
-rw-r--r-- 3 root supergroup 955 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C1/00000-0-b14f8238-33ad-48e7-a98c-809229415adf-00001.parquet
drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C2
-rw-r--r-- 3 root supergroup 955 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C2/00000-0-b14f8238-33ad-48e7-a98c-809229415adf-00002.parquet
drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata
-rw-r--r-- 3 root supergroup 6127 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/8820a6f2-bd06-4765-9141-549a8387a621-m0.avro
-rw-r--r-- 3 root supergroup 3789 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/snap-8809797762112100800-1-8820a6f2-bd06-4765-9141-549a8387a621.avro
-rw-r--r-- 3 root supergroup 1602 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 2652 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/v2.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/version-hint.text
- 所有行对应的分区将被替换 ```bash INSERT OVERWRITE iceberg_table_pt VALUES (11, ‘aa’, ‘C1’), (44, ‘dd’, ‘C2’);
SELECT * FROM iceberg_table_pt; +——+———+—————+ | id | data | category | +——+———+—————+ | 11 | aa | C1 | | 44 | dd | C2 | +——+———+—————+ 2 rows in set
- 通过 PARTITION 子句覆盖给定的分区
```bash
INSERT OVERWRITE iceberg_table_pt PARTITION(category='C1') SELECT 111, 'aaa';
SELECT * FROM iceberg_table_pt;
+-----+------+----------+
| id | data | category |
+-----+------+----------+
| 44 | dd | C2 |
| 111 | aaa | C1 |
+-----+------+----------+
2 rows in set
:::info 对于分区表,当所有分区列在 PARTITION 子句中设置了值时,它是插入到静态分区中,如果部分分区列(所有分区列的前缀部分)在 PARTITION 子句中设置了值,则将查询结果写入动态分区。 :::
查询
批读
SET 'execution.runtime-mode' = 'batch';
SELECT * FROM iceberg_table;
流读
全量接增量读
- 开启流读 ```sql SET ‘execution.runtime-mode’ = ‘streaming’;
— Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. SET table.dynamic-table-options.enabled=true;
- 从当前快照中读取所有记录,然后从该快照开始读取增量数据。(全量数据+增量数据)
```sql
SELECT * FROM iceberg_table /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ;
+----+----------------------+--------------------------------+
| op | id | data |
+----+----------------------+--------------------------------+
| +I | 1 | a |
由于 iceberg_table 进行过 overwrite, 所以只能看到一条数据
- 在另一个窗口启动 Flink SQL Client 并插入 3 条新的数据 ```bash INSERT INTO iceberg_table(id, data) VALUES (5, ‘e’); INSERT INTO iceberg_table(id, data) VALUES (6, ‘f’); INSERT INTO iceberg_table(id, data) VALUES (7, ‘g’);
SELECT FROM iceberg_table /+ OPTIONS(‘streaming’=’true’, ‘monitor-interval’=’1s’)*/ ; +——+———————————+————————————————+ | op | id | data | +——+———————————+————————————————+ | +I | 1 | a | | +I | 6 | f | | +I | 5 | e | | +I | 7 | g |
可以看到之前的查询中会新增新插入的数据
- 查看数据,新增了 3 个 metadata.json
```sql
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
-rw-r--r-- 3 root supergroup 6667 2022-04-24 14:05 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v6.metadata.json
-rw-r--r-- 3 root supergroup 7744 2022-04-24 14:05 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v7.metadata.json
-rw-r--r-- 3 root supergroup 8821 2022-04-24 14:10 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v8.metadata.json
增量读
- 开启流读 ```sql SET ‘execution.runtime-mode’ = ‘streaming’;
— Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. SET table.dynamic-table-options.enabled=true;
- 查看 overwrite( v5.metadata.json) 的 snapshot-id
```sql
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v5.metadata.json | grep current-snapshot-id
"current-snapshot-id" : 4024674471005358751,
读取从快照 ‘4024674471005358751’ 开始的所有增量数据,不会读取该快照的记录(增量数据)
SELECT * FROM iceberg_table /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='4024674471005358751')*/ ;
+----+----------------------+--------------------------------+
| op | id | data |
+----+----------------------+--------------------------------+
| +I | 6 | f |
| +I | 7 | g |
| +I | 5 | e |
这里只能看到 overwrite 后新插入的 3 条数据
只能指定最后一个 insert overwrite 操作的 snapshot id 及其后面的 snapshot id,否则后台会报异常,且程序一直处于 restarting 的状态
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json | grep current-snapshot-id
"current-snapshot-id" : 1586352819259469942,
```sql SELECT FROM iceberg_table /+ OPTIONS(‘streaming’=’true’, ‘monitor-interval’=’1s’, ‘start-snapshot-id’=’1586352819259469942’)*/ ;
+——+———————————+————————————————+ | op | id | data | +——+———————————+————————————————+ [ERROR] Could not execute SQL statement. Reason: java.lang.UnsupportedOperationException: Found overwrite operation, cannot support incremental data in snapshots (1586352819259469942, 4024674471005358751]
:::info
当前仅从 append 操作中获取数据。 不支持 replace、overwrite、delete 操作
:::
<a name="zARgY"></a>
# Iceberg 存储结构
Iceberg 常用术语
- 快照(Snapshot) – 表在某个时刻的状态,包括所有数据文件的集合,每个快照对应一个清单列表。
- 清单列表(Manifest list)– Avro 文件,列出清单文件的列表,每个清单文件占据一行。
- 清单文件(Manifest file)– Avro 文件,列出组成某个快照(snapshot)的数据文件列表,每个数据文件占据一行。
- 数据文件(Data file)– 真实存储数据的文件,一般在表的 data 目录下。
<a name="R1Ue0"></a>
## 创建表
- 删除之前创建的表
```bash
DROP TABLE iceberg_table;
DROP TABLE iceberg_table_pt;
- 创建表 ```bash USE hadoop_catalog.iceberg_db;
CREATE TABLE iceberg_table ( id BIGINT COMMENT ‘unique id’, data STRING );
- 查看 HDFS 目录,此时新增了 v1.metadata.json 和 version-hint.text 两个文件
```bash
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/
drwxr-xr-x - root supergroup 0 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
查看 version-hint.text,里面保存的是当前的版本号
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
1
查看 v1.metadata.json
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
{
"format-version" : 1,
"table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3",
"location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table",
"last-updated-ms" : 1650786845358,
"last-column-id" : 2,
"schema" : {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "id",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "data",
"required" : false,
"type" : "string"
} ]
},
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "id",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "data",
"required" : false,
"type" : "string"
} ]
} ],
"partition-spec" : [ ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : { },
"current-snapshot-id" : -1,
"snapshots" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
}
可以看到 “current-snapshot-id” : -1,表示表刚刚建立,这里没有太多有用信息,重点关注一下 schema
插入一条数据
写入数据
INSERT INTO iceberg_table(id, data) VALUES (1, 'a');
查看 HDFS 目录,此时新增了一个 v2.metadata.json,快照(snap-.avro),清单文件(-m0.avro)和数据文件(*.parquet)
dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data
-rw-r--r-- 3 root supergroup 658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet
drwxr-xr-x - root supergroup 0 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
-rw-r--r-- 3 root supergroup 3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
-rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
查看 version-hint.text,此时已经变成了 2
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text 2
查看 v2.metadata.json
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json { "format-version" : 1, "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3", "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table", "last-updated-ms" : 1650787052068, "last-column-id" : 2, "schema" : { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] }, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] } ], "partition-spec" : [ ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { }, "current-snapshot-id" : 6040080682987879495, "snapshots" : [ { "snapshot-id" : 6040080682987879495, "timestamp-ms" : 1650787052068, "summary" : { "operation" : "append", "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "added-records" : "1", "added-files-size" : "658", "changed-partition-count" : "1", "total-records" : "1", "total-files-size" : "658", "total-data-files" : "1", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro", "schema-id" : 0 } ], "snapshot-log" : [ { "timestamp-ms" : 1650787052068, "snapshot-id" : 6040080682987879495 } ], "metadata-log" : [ { "timestamp-ms" : 1650786845358, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json" } ] }
可以看到 “current-snapshot-id” : 6040080682987879495,表示当前的 snapshot id。snapshots、snapshot-log 的列表都包含这个快照,metadata-log 包含了前一个版本的 metadata。当前快照的清单列表是 snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
当前快照创建了一个数据文件并增加了一条数据
"added-data-files" : "1",
"added-records" : "1",
- 查看快照文件
:::info 查看 avro 格式的文件需要使用外部工具 avro-tools-1.10.2.jar :::curl -O https://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.10.2/avro-tools-1.10.2.jar
可以看到快照文件只有一行数据java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro {"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":6040080682987879495},"added_data_files_count":{"int":1},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":0},"partitions":{"array":[]},"added_rows_count":{"long":1},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":0}}
下面是 json 格式化后的结果,对应的清单文件 7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro",
"manifest_length":5793,
"partition_spec_id":0,
"added_snapshot_id":{
"long":6040080682987879495
},
"added_data_files_count":{
"int":1
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":0
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":1
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":0
}
}
- 查看清单文件 ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
{“status”:1,”snapshot_id”:{“long”:6040080682987879495},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}
可以看到清单文件只有一行数据。
下面是 json 格式化后的结果,对应的数据文件 00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
"status":1,
"snapshot_id":{
"long":6040080682987879495
},
"data_file":{
"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet",
"file_format":"PARQUET",
"partition":{
},
"record_count":1,
"file_size_in_bytes":658,
"block_size_in_bytes":67108864,
"column_sizes":{
"array":[
{
"key":1,
"value":52
},
{
"key":2,
"value":52
}
]
},
"value_counts":{
"array":[
{
"key":1,
"value":1
},
{
"key":2,
"value":1
}
]
},
"null_value_counts":{
"array":[
{
"key":1,
"value":0
},
{
"key":2,
"value":0
}
]
},
"nan_value_counts":{
"array":[
]
},
"lower_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"upper_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"key_metadata":null,
"split_offsets":{
"array":[
4
]
},
"sort_order_id":{
"int":0
}
}
}
再插入一条数据
写入数据
INSERT INTO iceberg_table(id, data) VALUES (2, 'b');
查看 HDFS 目录,此时新增了一个 v3.metadata.json,快照(snap-.avro),清单文件(-m0.avro)和数据文件(*.parquet)
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table drwxr-xr-x - root supergroup 0 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data -rw-r--r-- 3 root supergroup 657 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet -rw-r--r-- 3 root supergroup 658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet drwxr-xr-x - root supergroup 0 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata -rw-r--r-- 3 root supergroup 5792 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro -rw-r--r-- 3 root supergroup 5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro -rw-r--r-- 3 root supergroup 3848 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro -rw-r--r-- 3 root supergroup 3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro -rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json -rw-r--r-- 3 root supergroup 2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json -rw-r--r-- 3 root supergroup 3295 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json -rw-r--r-- 3 root supergroup 1 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
查看 version-hint.text,此时已经变成了 3
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text 3
查看 v3.metadata.json
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json { "format-version" : 1, "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3", "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table", "last-updated-ms" : 1650788733525, "last-column-id" : 2, "schema" : { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] }, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] } ], "partition-spec" : [ ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { }, "current-snapshot-id" : 2331249842714188418, "snapshots" : [ { "snapshot-id" : 6040080682987879495, "timestamp-ms" : 1650787052068, "summary" : { "operation" : "append", "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "added-records" : "1", "added-files-size" : "658", "changed-partition-count" : "1", "total-records" : "1", "total-files-size" : "658", "total-data-files" : "1", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro", "schema-id" : 0 }, { "snapshot-id" : 2331249842714188418, "parent-snapshot-id" : 6040080682987879495, "timestamp-ms" : 1650788733525, "summary" : { "operation" : "append", "flink.job-id" : "f6d1cefdf0f8bea368d80eb9081f6649", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "added-records" : "1", "added-files-size" : "657", "changed-partition-count" : "1", "total-records" : "2", "total-files-size" : "1315", "total-data-files" : "2", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro", "schema-id" : 0 } ], "snapshot-log" : [ { "timestamp-ms" : 1650787052068, "snapshot-id" : 6040080682987879495 }, { "timestamp-ms" : 1650788733525, "snapshot-id" : 2331249842714188418 } ], "metadata-log" : [ { "timestamp-ms" : 1650786845358, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json" }, { "timestamp-ms" : 1650787052068, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json" } ] }
与第一条数据插入后的结果十分类似,可以看到 “current-snapshot-id” : 6040080682987879495。snapshots、snapshot-log 的列表都新增了这个快照。metadata-log 新增 v2.metadata.json。
当前快照创建了一个数据文件并增加了一条数据
"added-data-files" : "1",
"added-records" : "1",
- 查看快照文件 ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro
{“manifest_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro”,”manifest_length”:5792,”partition_spec_id”:0,”added_snapshot_id”:{“long”:2331249842714188418},”added_data_files_count”:{“int”:1},”existing_data_files_count”:{“int”:0},”deleted_data_files_count”:{“int”:0},”partitions”:{“array”:[]},”added_rows_count”:{“long”:1},”existing_rows_count”:{“long”:0},”deleted_rows_count”:{“long”:0}} {“manifest_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro”,”manifest_length”:5793,”partition_spec_id”:0,”added_snapshot_id”:{“long”:6040080682987879495},”added_data_files_count”:{“int”:1},”existing_data_files_count”:{“int”:0},”deleted_data_files_count”:{“int”:0},”partitions”:{“array”:[]},”added_rows_count”:{“long”:1},”existing_rows_count”:{“long”:0},”deleted_rows_count”:{“long”:0}}
可以看到快照文件只有两行数据,分别对应本次和上一次插入
下面是第一行数据 json 格式化后的结果,对应的清单文件 2664ed23-930f-4510-9bf6-94d456155312-m0.avro
```bash
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro",
"manifest_length":5792,
"partition_spec_id":0,
"added_snapshot_id":{
"long":2331249842714188418
},
"added_data_files_count":{
"int":1
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":0
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":1
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":0
}
}
下面是第二行数据 json 格式化后的结果,对应的清单文件 7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro",
"manifest_length":5793,
"partition_spec_id":0,
"added_snapshot_id":{
"long":6040080682987879495
},
"added_data_files_count":{
"int":1
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":0
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":1
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":0
}
}
- 查看清单文件
第二个清单文件是第一次插入生成的,所以这里我们查看第一个清单文件
java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
{"status":1,"snapshot_id":{"long":2331249842714188418},"data_file":{"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet","file_format":"PARQUET","partition":{},"record_count":1,"file_size_in_bytes":657,"block_size_in_bytes":67108864,"column_sizes":{"array":[{"key":1,"value":51},{"key":2,"value":52}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"b"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"b"}]},"key_metadata":null,"split_offsets":{"array":[4]},"sort_order_id":{"int":0}}}
与第一次插入类似,可以看到清单文件只有一行数据。
下面是 json 格式化后的结果,对应的数据文件 00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)
{
"status":1,
"snapshot_id":{
"long":2331249842714188418
},
"data_file":{
"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet",
"file_format":"PARQUET",
"partition":{
},
"record_count":1,
"file_size_in_bytes":657,
"block_size_in_bytes":67108864,
"column_sizes":{
"array":[
{
"key":1,
"value":51
},
{
"key":2,
"value":52
}
]
},
"value_counts":{
"array":[
{
"key":1,
"value":1
},
{
"key":2,
"value":1
}
]
},
"null_value_counts":{
"array":[
{
"key":1,
"value":0
},
{
"key":2,
"value":0
}
]
},
"nan_value_counts":{
"array":[
]
},
"lower_bounds":{
"array":[
{
"key":1,
"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"b"
}
]
},
"upper_bounds":{
"array":[
{
"key":1,
"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"b"
}
]
},
"key_metadata":null,
"split_offsets":{
"array":[
4
]
},
"sort_order_id":{
"int":0
}
}
}
INSERT OVERWRITE
- 覆盖数据 ```bash SET ‘execution.runtime-mode’ = ‘batch’;
INSERT OVERWRITE iceberg_table VALUES (1, ‘a’);
- 查看 HDFS 目录,此时新增了一个 v4.metadata.json,快照(snap-*.avro),3 个清单文件(*-m0.avro,*-m1.avro,*-m2.avro)和数据文件(*.parquet)
```bash
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x - root supergroup 0 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/data
-rw-r--r-- 3 root supergroup 657 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet
-rw-r--r-- 3 root supergroup 658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet
-rw-r--r-- 3 root supergroup 658 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet
drwxr-xr-x - root supergroup 0 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r-- 3 root supergroup 5792 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro
-rw-r--r-- 3 root supergroup 5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
-rw-r--r-- 3 root supergroup 3848 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro
-rw-r--r-- 3 root supergroup 3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
-rw-r--r-- 3 root supergroup 3808 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro
-rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r-- 3 root supergroup 2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
-rw-r--r-- 3 root supergroup 3295 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
-rw-r--r-- 3 root supergroup 4513 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json
-rw-r--r-- 3 root supergroup 1 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
查看 version-hint.text,此时已经变成了 4
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text 4
查看 v4.metadata.json
hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json { "format-version" : 1, "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3", "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table", "last-updated-ms" : 1650790279485, "last-column-id" : 2, "schema" : { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] }, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "long" }, { "id" : 2, "name" : "data", "required" : false, "type" : "string" } ] } ], "partition-spec" : [ ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { }, "current-snapshot-id" : 8430489177598754103, "snapshots" : [ { "snapshot-id" : 6040080682987879495, "timestamp-ms" : 1650787052068, "summary" : { "operation" : "append", "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "added-records" : "1", "added-files-size" : "658", "changed-partition-count" : "1", "total-records" : "1", "total-files-size" : "658", "total-data-files" : "1", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro", "schema-id" : 0 }, { "snapshot-id" : 2331249842714188418, "parent-snapshot-id" : 6040080682987879495, "timestamp-ms" : 1650788733525, "summary" : { "operation" : "append", "flink.job-id" : "f6d1cefdf0f8bea368d80eb9081f6649", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "added-records" : "1", "added-files-size" : "657", "changed-partition-count" : "1", "total-records" : "2", "total-files-size" : "1315", "total-data-files" : "2", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro", "schema-id" : 0 }, { "snapshot-id" : 8430489177598754103, "parent-snapshot-id" : 2331249842714188418, "timestamp-ms" : 1650790279485, "summary" : { "operation" : "overwrite", "replace-partitions" : "true", "flink.job-id" : "27db50de8bf7d70905450e54300b3a41", "flink.max-committed-checkpoint-id" : "9223372036854775807", "added-data-files" : "1", "deleted-data-files" : "2", "added-records" : "1", "deleted-records" : "2", "added-files-size" : "658", "removed-files-size" : "1315", "changed-partition-count" : "1", "total-records" : "1", "total-files-size" : "658", "total-data-files" : "1", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro", "schema-id" : 0 } ], "snapshot-log" : [ { "timestamp-ms" : 1650787052068, "snapshot-id" : 6040080682987879495 }, { "timestamp-ms" : 1650788733525, "snapshot-id" : 2331249842714188418 }, { "timestamp-ms" : 1650790279485, "snapshot-id" : 8430489177598754103 } ], "metadata-log" : [ { "timestamp-ms" : 1650786845358, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json" }, { "timestamp-ms" : 1650787052068, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json" }, { "timestamp-ms" : 1650788733525, "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json" } ] }
当前快照先删除了两个数据文件然后创建了一个数据文件并增加了一条数据,当前快照操作类型为 overwrite。 ```bash “added-data-files” : “1”, “deleted-data-files” : “2”, “added-records” : “1”,
“operation” : “overwrite”,
- 查看快照文件
```bash
java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro
{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":1},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":0},"partitions":{"array":[]},"added_rows_count":{"long":1},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":0}}
{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":0},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":1},"partitions":{"array":[]},"added_rows_count":{"long":0},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":1}}
{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":0},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":1},"partitions":{"array":[]},"added_rows_count":{"long":0},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":1}}
可以看到快照文件有三行数据,分别对应上一次和本次插入
下面是第一行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro",
"manifest_length":5793,
"partition_spec_id":0,
"added_snapshot_id":{
"long":8430489177598754103
},
"added_data_files_count":{
"int":1
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":0
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":1
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":0
}
}
下面是第二行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro",
"manifest_length":5793,
"partition_spec_id":0,
"added_snapshot_id":{
"long":8430489177598754103
},
"added_data_files_count":{
"int":0
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":1
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":0
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":1
}
}
下面是第三行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro
{
"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro",
"manifest_length":5793,
"partition_spec_id":0,
"added_snapshot_id":{
"long":8430489177598754103
},
"added_data_files_count":{
"int":0
},
"existing_data_files_count":{
"int":0
},
"deleted_data_files_count":{
"int":1
},
"partitions":{
"array":[
]
},
"added_rows_count":{
"long":0
},
"existing_rows_count":{
"long":0
},
"deleted_rows_count":{
"long":1
}
}
- 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro
{“status”:2,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}
下面是 json 格式化后的结果,对应的数据文件 00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet(第一次插入生成的数据文件),status 设置成了 2(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
"status":2,
"snapshot_id":{
"long":8430489177598754103
},
"data_file":{
"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet",
"file_format":"PARQUET",
"partition":{
},
"record_count":1,
"file_size_in_bytes":658,
"block_size_in_bytes":67108864,
"column_sizes":{
"array":[
{
"key":1,
"value":52
},
{
"key":2,
"value":52
}
]
},
"value_counts":{
"array":[
{
"key":1,
"value":1
},
{
"key":2,
"value":1
}
]
},
"null_value_counts":{
"array":[
{
"key":1,
"value":0
},
{
"key":2,
"value":0
}
]
},
"nan_value_counts":{
"array":[
]
},
"lower_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"upper_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"key_metadata":null,
"split_offsets":{
"array":[
4
]
},
"sort_order_id":{
"int":0
}
}
}
- 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro
{“status”:2,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:657,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:51},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”b”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”b”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}
下面是 json 格式化后的结果,对应的数据文件 00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet(第二次插入生成的数据文件),status 设置成了 2(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
"status":2,
"snapshot_id":{
"long":8430489177598754103
},
"data_file":{
"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet",
"file_format":"PARQUET",
"partition":{
},
"record_count":1,
"file_size_in_bytes":657,
"block_size_in_bytes":67108864,
"column_sizes":{
"array":[
{
"key":1,
"value":51
},
{
"key":2,
"value":52
}
]
},
"value_counts":{
"array":[
{
"key":1,
"value":1
},
{
"key":2,
"value":1
}
]
},
"null_value_counts":{
"array":[
{
"key":1,
"value":0
},
{
"key":2,
"value":0
}
]
},
"nan_value_counts":{
"array":[
]
},
"lower_bounds":{
"array":[
{
"key":1,
"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"b"
}
]
},
"upper_bounds":{
"array":[
{
"key":1,
"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"b"
}
]
},
"key_metadata":null,
"split_offsets":{
"array":[
4
]
},
"sort_order_id":{
"int":0
}
}
}
- 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro
{“status”:1,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}
下面是 json 格式化后的结果,对应的数据文件 00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
"status":1,
"snapshot_id":{
"long":8430489177598754103
},
"data_file":{
"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet",
"file_format":"PARQUET",
"partition":{
},
"record_count":1,
"file_size_in_bytes":658,
"block_size_in_bytes":67108864,
"column_sizes":{
"array":[
{
"key":1,
"value":52
},
{
"key":2,
"value":52
}
]
},
"value_counts":{
"array":[
{
"key":1,
"value":1
},
{
"key":2,
"value":1
}
]
},
"null_value_counts":{
"array":[
{
"key":1,
"value":0
},
{
"key":2,
"value":0
}
]
},
"nan_value_counts":{
"array":[
]
},
"lower_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"upper_bounds":{
"array":[
{
"key":1,
"value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
},
{
"key":2,
"value":"a"
}
]
},
"key_metadata":null,
"split_offsets":{
"array":[
4
]
},
"sort_order_id":{
"int":0
}
}
}