:::tips 2注意⚠️
以 Hadoop Catalog 为例
hostname 设置为 datalake :::

安装 Hadoop

Flink On Yarn 部署指南

安装并启动 Flink

  • 下载 Flink 1.14.4 (使用的是 scala_2.12 版本)并将其解压至目录 flink-1.14.4

    1. wget https://dlcdn.apache.org/flink/flink-1.14.4/flink-1.14.4-bin-scala_2.12.tgz
    2. tar -xvf flink-1.14.4-bin-scala_2.12.tgz
  • 打开配置文件

    1. vi /etc/profile
  • 添加环境变量 HADOOP_CLASSPATH

    1. # HADOOP_HOME is your hadoop root directory after unpack the binary package.
    2. export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
  • 启动 Flink 集群

    1. cd flink-1.14.4
    2. bin/start-cluster.sh

    启动 Flink SQL Client

    使用 hadoop catalog(默认支持 hadoop catalog)

  • 下载 iceberg-flink-runtime 依赖包(flink-1.14.4 目录下)

    1. wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.1/iceberg-flink-runtime-1.14-0.13.1.jar
  • 启动 Flink SQL Client 时加载 iceberg-flink-runtime

    1. ./bin/sql-client.sh -j iceberg-flink-runtime-1.14-0.13.1.jar

    使用 hive catalog

  • 如果要使用 hive catalog,我们还需要下载 flink-sql-connector-hive 依赖包(flink-1.14.4 目录下) ```scala wget https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.1/iceberg-flink-runtime-1.14-0.13.1.jar

wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.6_2.12/1.14.4/flink-sql-connector-hive-2.3.6_2.12-1.14.4.jar

  1. - 启动 Flink SQL Client 时加载 iceberg-flink-runtime flink-sql-connector-hive
  2. ```scala
  3. ./bin/sql-client.sh -j flink-sql-connector-hive-2.3.6_2.12-1.14.4.jar -j iceberg-flink-runtime-1.14-0.13.1.jar

:::info 备注:
也可以将 Iceberg 的依赖包下载 到 Flink 的 lib 目录下,然后重启 Flink,这样启动 Flink SQL Client 的时候可以不使用 -j 加载 Iceberg 的依赖包 :::

创建并使用 catalog

创建 Hive catalog

使用 Hive 的 metastore 保存元数据,HDFS 保存数据库表的数据

  1. CREATE CATALOG hive_catalog WITH (
  2. 'type'='iceberg',
  3. 'catalog-type'='hive',
  4. 'uri'='thrift://datalake:9083',
  5. 'clients'='5',
  6. 'property-version'='1',
  7. 'warehouse'='hdfs://datalake:9000/warehouse/hive'
  8. );

:::info

  • type:必须是 iceberg (必需的)。
  • catalog-type:hive 或 hadoop(默认),或者不设置并通过 catalog-impl 来使用自定义 catalog 。
  • uri:hive metastore 地址。
  • clients:在 hive metastore 中,hive_catalog 供客户端访问的连接池大小,默认是 2。
  • property-version:描述属性版本的版本号。 如果属性格式发生变化,此属性可用于向后兼容。 当前属性版本为 1。
  • warehouse:Flink 集群所在的 HDFS 路径,hive_catalog 下的数据库表存放数据的位置,优先级比 hive-conf-dir 的高。
  • cache-enabled:是否启用 catalog 缓存,默认值为 true。
  • hive-conf-dir:hive 集群的配置目录(只能是本地路径)。使用从 hive-site.xml 解析出来的 HDFS 路径。 :::

创建 Hadoop catalog

使用 HDFS 保存元数据和数据库表的数据

  1. CREATE CATALOG hadoop_catalog WITH (
  2. 'type'='iceberg',
  3. 'catalog-type'='hadoop',
  4. 'warehouse'='hdfs://datalake:9000/warehouse/hadoop',
  5. 'property-version'='1'
  6. );

:::info

  • type:必须是 iceberg (必需的)。
  • catalog-type:hive 或 hadoop(默认),或者不设置并通过 catalog-impl 来使用自定义 catalog 。
  • warehouse:Flink 集群所在的 HDFS 路径,hadoop_catalog 下的数据库表存放元数据和数据的位置。
  • property-version:描述属性版本的版本号。 如果属性格式发生变化,此属性可用于向后兼容。 当前属性版本为 1。
  • cache-enabled:是否启用 catalog 缓存,默认值为 true。 :::

  • 创建 catalog 会在 HDFS 目录上创建 hadoop 子目录,以及 hadoop/default 子目录(catalog 默认都有一个 default 数据库)

    1. hdfs dfs -ls -R /warehouse
    2. drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop
    3. drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default

:::info 通过配置 conf/sql-cli-defaults.yaml 实现启动 Flink SQL Client 时创建 catalog。但测试的时候并未生效。

catalogs:
- name: hadoop_catalog
type: iceberg
catalog-type: hadoop
property-version: 1
cache-enabled: true
warehouse: hdfs://datalake:9000/warehouse/hadoop :::

使用 catalog

  1. USE CATALOG hadoop_catalog;

DDL 命令

数据库

查看数据库

catalog 默认都有一个 default 数据库

  1. SHOW DATABASES;
  2. +---------------+
  3. | database name |
  4. +---------------+
  5. | default |
  6. +---------------+
  7. 1 row in set

创建数据库

  • 如果我们不想在默认数据库下创建表,可以创建单独的数据库 ```sql CREATE DATABASE iceberg_db;

SHOW DATABASES; +———————-+ | database name | +———————-+ | default | | iceberg_db | +———————-+ 2 rows in set

  1. - 创建数据库会在 HDFS 目录上创建 iceberg_db 子目录
  2. ```sql
  3. hdfs dfs -ls -R /warehouse/hadoop
  4. drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default
  5. drwxr-xr-x - root supergroup 0 2022-04-22 16:58 /warehouse/hadoop/iceberg_db

删除数据库

  • 删除数据库 ```sql DROP DATABASE iceberg_db;

SHOW DATABASES; +———————-+ | database name | +———————-+ | default | +———————-+ 1 row in set

  1. - 如果删除数据库,会删除 HDFS 上的 iceberg_db 子目录
  2. ```sql
  3. hdfs dfs -ls -R /warehouse/hadoop
  4. drwxr-xr-x - root supergroup 0 2022-04-22 16:34 /warehouse/hadoop/default

使用数据库

  1. USE CATALOG hadoop_catalog;
  2. USE iceberg_db;

或者

  1. USE hadoop_catalog.iceberg_db;

创建表

  • 创建表 ```sql USE hadoop_catalog.iceberg_db;

CREATE TABLE iceberg_table ( id BIGINT COMMENT ‘unique id’, data STRING );

  1. 或者
  2. ```sql
  3. CREATE TABLE hadoop_catalog.iceberg_db.iceberg_table (
  4. id BIGINT COMMENT 'unique id',
  5. data STRING
  6. );

:::info 目前表不支持的功能

  • 隐藏分区
  • 计算列
  • watermark
  • 添加列、删除列、重命名列、更改列。 FLINK-19062 正在跟踪 :::

:::info 可以通过 format-version 指定 format
Version 1: appent 表,可以插入所有测试类型字段,但是查询 TINYINT,SMALLINT 数据类型失败,不能设置主键进行 upsert
Version 2: upsert 表,开启 upsert 时,不可插入TINYINT,SMALLINT,主键必须指定分区字段,因为 delete writer 必须知道写入的分区是什么,所以主键才必须包含分区使用的字段,会根据 key 来修改数据 :::

  • 创建表会在 HDFS 目录上创建 iceberg_table 子目录
    1. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db
    2. drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table
    3. drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
    4. -rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
    5. -rw-r--r-- 3 root supergroup 1 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text

创建分区表

  1. CREATE TABLE hadoop_catalog.iceberg_db.iceberg_table_pt (
  2. id BIGINT COMMENT 'unique id',
  3. data STRING,
  4. category STRING
  5. ) PARTITIONED BY (category);

:::info Iceberg 支持隐藏分区,而 Flink 不支持按列的函数进行分区,现在无法在 Flink DDL 中支持隐藏分区 :::

复制表

  1. USE hadoop_catalog.iceberg_db;
  2. CREATE TABLE iceberg_table_copy LIKE iceberg_table;

复制的表拥有相同的表结构、分区、表属性

修改表

:::info 注意⚠️
目前只支持修改 iceberg 的表属性和重命名表 :::

  • 修改表属性 ```sql USE hadoop_catalog.iceberg_db; ALTER TABLE iceberg_table_copy SET (‘write.format.default’=’avro’);

SHOW CREATE TABLE iceberg_table_copy; CREATE TABLE hadoop_catalog.iceberg_db.iceberg_table_copy ( id BIGINT, data VARCHAR(2147483647) ) WITH ( ‘write.format.default’ = ‘avro’ )

  1. - 重命名表(Hadoop Catalog 中的表不支持重命名表)
  2. ```sql
  3. USE hadoop_catalog.iceberg_db;
  4. ALTER TABLE iceberg_table_copy RENAME TO iceberg_table_copy_new;
  5. [ERROR] Could not execute SQL statement. Reason:
  6. java.lang.UnsupportedOperationException: Cannot rename Hadoop tables

删除表

  • 删除表

    1. USE hadoop_catalog.iceberg_db;
    2. DROP TABLE iceberg_table_copy;
  • 如果删除表,会删除 HDFS 上的 iceberg_table_copy 子目录

    1. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db
    2. drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table
    3. drwxr-xr-x - root supergroup 0 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
    4. -rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
    5. -rw-r--r-- 3 root supergroup 1 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text

写入

INSERT INTO

  • 写入数据 ```sql USE hadoop_catalog.iceberg_db; INSERT INTO iceberg_table(id, data) VALUES (1, ‘a’); INSERT INTO iceberg_table(id, data) VALUES (2, ‘b’);

INSERT INTO iceberg_table SELECT (id + 2), data FROM iceberg_table;

  1. - 查看数据,可以看到每条 SQL 都会生成一个 metadata.json 以及 parquet 文件
  2. ```sql
  3. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
  4. drwxr-xr-x - root supergroup 0 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data
  5. -rw-r--r-- 3 root supergroup 658 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-81dd1531-9c34-4108-b97d-0a0dedd2270f-00001.parquet
  6. -rw-r--r-- 3 root supergroup 658 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-d92a76a2-cbf9-4b20-89af-87706ede571e-00001.parquet
  7. -rw-r--r-- 3 root supergroup 657 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-e507ea82-15a5-44d5-a5c0-2f475c877599-00001.parquet
  8. drwxr-xr-x - root supergroup 0 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
  9. -rw-r--r-- 3 root supergroup 5798 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/21564066-ba86-46a2-944c-48a20171ddf4-m0.avro
  10. -rw-r--r-- 3 root supergroup 5793 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/27a5e943-70f1-4c10-b52f-f75f9772c9db-m0.avro
  11. -rw-r--r-- 3 root supergroup 5793 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/3ad437f2-0db8-4096-8419-6e9845253dc8-m0.avro
  12. -rw-r--r-- 3 root supergroup 3890 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-1586352819259469942-1-21564066-ba86-46a2-944c-48a20171ddf4.avro
  13. -rw-r--r-- 3 root supergroup 3842 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-4988714748194111917-1-3ad437f2-0db8-4096-8419-6e9845253dc8.avro
  14. -rw-r--r-- 3 root supergroup 3776 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6065247658548389325-1-27a5e943-70f1-4c10-b52f-f75f9772c9db.avro
  15. -rw-r--r-- 3 root supergroup 1176 2022-04-22 17:00 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
  16. -rw-r--r-- 3 root supergroup 2218 2022-04-22 17:24 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
  17. -rw-r--r-- 3 root supergroup 3295 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
  18. -rw-r--r-- 3 root supergroup 4372 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json
  19. -rw-r--r-- 3 root supergroup 1 2022-04-22 17:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text

INSERT OVERWRITE

:::info 只支持 Batch 模式,且 overwrite 粒度为 partition,替换的是整个分区,而不是一行数据。如果不是分区表,则替换的是整个表。 :::

普通表

  • iceberg_table 不是分区表,所以替换的是整个表 ```sql SET sql-client.execution.result-mode = tableau; SET ‘execution.runtime-mode’ = ‘batch’;

INSERT OVERWRITE iceberg_table VALUES (1, ‘a’);

SELECT * FROM iceberg_table; +——+———+ | id | data | +——+———+ | 1 | a | +——+———+ 1 row in set

  1. - 查看数据,新增了一个 v5.metadata.json
  2. ```sql
  3. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
  4. -rw-r--r-- 3 root supergroup 5590 2022-04-22 17:43 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v5.metadata.json

分区表

  • 插入数据 ```bash USE hadoop_catalog.iceberg_db;

INSERT INTO iceberg_table_pt(id, data, category) VALUES (1, ‘a’, ‘C1’), (2, ‘b’, ‘C1’), (3, ‘c’, ‘C2’), (4, ‘d’, ‘C2’);

SELECT * FROM iceberg_table_pt;

+——+———+—————+ | id | data | category | +——+———+—————+ | 1 | a | C1 | | 2 | b | C1 | | 3 | c | C2 | | 4 | d | C2 | +——+———+—————+ 4 rows in set

  1. - 不同分区的数据保存在对应分区的文件夹下(category=C1 category=C2
  2. ```bash
  3. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table_pt
  4. drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data
  5. drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C1
  6. -rw-r--r-- 3 root supergroup 955 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C1/00000-0-b14f8238-33ad-48e7-a98c-809229415adf-00001.parquet
  7. drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C2
  8. -rw-r--r-- 3 root supergroup 955 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/data/category=C2/00000-0-b14f8238-33ad-48e7-a98c-809229415adf-00002.parquet
  9. drwxr-xr-x - root supergroup 0 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata
  10. -rw-r--r-- 3 root supergroup 6127 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/8820a6f2-bd06-4765-9141-549a8387a621-m0.avro
  11. -rw-r--r-- 3 root supergroup 3789 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/snap-8809797762112100800-1-8820a6f2-bd06-4765-9141-549a8387a621.avro
  12. -rw-r--r-- 3 root supergroup 1602 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/v1.metadata.json
  13. -rw-r--r-- 3 root supergroup 2652 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/v2.metadata.json
  14. -rw-r--r-- 3 root supergroup 1 2022-04-24 15:21 /warehouse/hadoop/iceberg_db/iceberg_table_pt/metadata/version-hint.text
  • 所有行对应的分区将被替换 ```bash INSERT OVERWRITE iceberg_table_pt VALUES (11, ‘aa’, ‘C1’), (44, ‘dd’, ‘C2’);

SELECT * FROM iceberg_table_pt; +——+———+—————+ | id | data | category | +——+———+—————+ | 11 | aa | C1 | | 44 | dd | C2 | +——+———+—————+ 2 rows in set

  1. - 通过 PARTITION 子句覆盖给定的分区
  2. ```bash
  3. INSERT OVERWRITE iceberg_table_pt PARTITION(category='C1') SELECT 111, 'aaa';
  4. SELECT * FROM iceberg_table_pt;
  5. +-----+------+----------+
  6. | id | data | category |
  7. +-----+------+----------+
  8. | 44 | dd | C2 |
  9. | 111 | aaa | C1 |
  10. +-----+------+----------+
  11. 2 rows in set

:::info 对于分区表,当所有分区列在 PARTITION 子句中设置了值时,它是插入到静态分区中,如果部分分区列(所有分区列的前缀部分)在 PARTITION 子句中设置了值,则将查询结果写入动态分区。 :::

查询

批读

  1. SET 'execution.runtime-mode' = 'batch';
  2. SELECT * FROM iceberg_table;

流读

全量接增量读

  • 开启流读 ```sql SET ‘execution.runtime-mode’ = ‘streaming’;

— Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. SET table.dynamic-table-options.enabled=true;

  1. - 从当前快照中读取所有记录,然后从该快照开始读取增量数据。(全量数据+增量数据)
  2. ```sql
  3. SELECT * FROM iceberg_table /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ;
  4. +----+----------------------+--------------------------------+
  5. | op | id | data |
  6. +----+----------------------+--------------------------------+
  7. | +I | 1 | a |

由于 iceberg_table 进行过 overwrite, 所以只能看到一条数据

  • 在另一个窗口启动 Flink SQL Client 并插入 3 条新的数据 ```bash INSERT INTO iceberg_table(id, data) VALUES (5, ‘e’); INSERT INTO iceberg_table(id, data) VALUES (6, ‘f’); INSERT INTO iceberg_table(id, data) VALUES (7, ‘g’);

SELECT FROM iceberg_table /+ OPTIONS(‘streaming’=’true’, ‘monitor-interval’=’1s’)*/ ; +——+———————————+————————————————+ | op | id | data | +——+———————————+————————————————+ | +I | 1 | a | | +I | 6 | f | | +I | 5 | e | | +I | 7 | g |

  1. 可以看到之前的查询中会新增新插入的数据
  2. - 查看数据,新增了 3 metadata.json
  3. ```sql
  4. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
  5. -rw-r--r-- 3 root supergroup 6667 2022-04-24 14:05 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v6.metadata.json
  6. -rw-r--r-- 3 root supergroup 7744 2022-04-24 14:05 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v7.metadata.json
  7. -rw-r--r-- 3 root supergroup 8821 2022-04-24 14:10 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v8.metadata.json


增量读

  • 开启流读 ```sql SET ‘execution.runtime-mode’ = ‘streaming’;

— Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. SET table.dynamic-table-options.enabled=true;

  1. - 查看 overwrite v5.metadata.json snapshot-id
  2. ```sql
  3. hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v5.metadata.json | grep current-snapshot-id
  4. "current-snapshot-id" : 4024674471005358751,
  • 读取从快照 ‘4024674471005358751’ 开始的所有增量数据,不会读取该快照的记录(增量数据)

    1. SELECT * FROM iceberg_table /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='4024674471005358751')*/ ;
    2. +----+----------------------+--------------------------------+
    3. | op | id | data |
    4. +----+----------------------+--------------------------------+
    5. | +I | 6 | f |
    6. | +I | 7 | g |
    7. | +I | 5 | e |

    这里只能看到 overwrite 后新插入的 3 条数据

  • 只能指定最后一个 insert overwrite 操作的 snapshot id 及其后面的 snapshot id,否则后台会报异常,且程序一直处于 restarting 的状态

    1. hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json | grep current-snapshot-id
    2. "current-snapshot-id" : 1586352819259469942,

    ```sql SELECT FROM iceberg_table /+ OPTIONS(‘streaming’=’true’, ‘monitor-interval’=’1s’, ‘start-snapshot-id’=’1586352819259469942’)*/ ;

+——+———————————+————————————————+ | op | id | data | +——+———————————+————————————————+ [ERROR] Could not execute SQL statement. Reason: java.lang.UnsupportedOperationException: Found overwrite operation, cannot support incremental data in snapshots (1586352819259469942, 4024674471005358751]

  1. :::info
  2. 当前仅从 append 操作中获取数据。 不支持 replaceoverwritedelete 操作
  3. :::
  4. <a name="zARgY"></a>
  5. # Iceberg 存储结构
  6. Iceberg 常用术语
  7. - 快照(Snapshot 表在某个时刻的状态,包括所有数据文件的集合,每个快照对应一个清单列表。
  8. - 清单列表(Manifest list)– Avro 文件,列出清单文件的列表,每个清单文件占据一行。
  9. - 清单文件(Manifest file)– Avro 文件,列出组成某个快照(snapshot)的数据文件列表,每个数据文件占据一行。
  10. - 数据文件(Data file)– 真实存储数据的文件,一般在表的 data 目录下。
  11. <a name="R1Ue0"></a>
  12. ## 创建表
  13. - 删除之前创建的表
  14. ```bash
  15. DROP TABLE iceberg_table;
  16. DROP TABLE iceberg_table_pt;
  • 创建表 ```bash USE hadoop_catalog.iceberg_db;

CREATE TABLE iceberg_table ( id BIGINT COMMENT ‘unique id’, data STRING );

  1. - 查看 HDFS 目录,此时新增了 v1.metadata.json version-hint.text 两个文件
  2. ```bash
  3. hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/
  4. drwxr-xr-x - root supergroup 0 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table
  5. drwxr-xr-x - root supergroup 0 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
  6. -rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
  7. -rw-r--r-- 3 root supergroup 1 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
  • 查看 version-hint.text,里面保存的是当前的版本号

    1. hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
    2. 1
  • 查看 v1.metadata.json

    1. hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
    2. {
    3. "format-version" : 1,
    4. "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3",
    5. "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table",
    6. "last-updated-ms" : 1650786845358,
    7. "last-column-id" : 2,
    8. "schema" : {
    9. "type" : "struct",
    10. "schema-id" : 0,
    11. "fields" : [ {
    12. "id" : 1,
    13. "name" : "id",
    14. "required" : false,
    15. "type" : "long"
    16. }, {
    17. "id" : 2,
    18. "name" : "data",
    19. "required" : false,
    20. "type" : "string"
    21. } ]
    22. },
    23. "current-schema-id" : 0,
    24. "schemas" : [ {
    25. "type" : "struct",
    26. "schema-id" : 0,
    27. "fields" : [ {
    28. "id" : 1,
    29. "name" : "id",
    30. "required" : false,
    31. "type" : "long"
    32. }, {
    33. "id" : 2,
    34. "name" : "data",
    35. "required" : false,
    36. "type" : "string"
    37. } ]
    38. } ],
    39. "partition-spec" : [ ],
    40. "default-spec-id" : 0,
    41. "partition-specs" : [ {
    42. "spec-id" : 0,
    43. "fields" : [ ]
    44. } ],
    45. "last-partition-id" : 999,
    46. "default-sort-order-id" : 0,
    47. "sort-orders" : [ {
    48. "order-id" : 0,
    49. "fields" : [ ]
    50. } ],
    51. "properties" : { },
    52. "current-snapshot-id" : -1,
    53. "snapshots" : [ ],
    54. "snapshot-log" : [ ],
    55. "metadata-log" : [ ]
    56. }

    可以看到 “current-snapshot-id” : -1,表示表刚刚建立,这里没有太多有用信息,重点关注一下 schema

插入一条数据

  • 写入数据

    1. INSERT INTO iceberg_table(id, data) VALUES (1, 'a');
  • 查看 HDFS 目录,此时新增了一个 v2.metadata.json,快照(snap-.avro),清单文件(-m0.avro)和数据文件(*.parquet)

    1. dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
    2. drwxr-xr-x - root supergroup 0 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data
    3. -rw-r--r-- 3 root supergroup 658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet
    4. drwxr-xr-x - root supergroup 0 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
    5. -rw-r--r-- 3 root supergroup 5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
    6. -rw-r--r-- 3 root supergroup 3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
    7. -rw-r--r-- 3 root supergroup 1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
    8. -rw-r--r-- 3 root supergroup 2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
    9. -rw-r--r-- 3 root supergroup 1 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
  • 查看 version-hint.text,此时已经变成了 2

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
    2
    
  • 查看 v2.metadata.json

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
    {
    "format-version" : 1,
    "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3",
    "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table",
    "last-updated-ms" : 1650787052068,
    "last-column-id" : 2,
    "schema" : {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    },
    "current-schema-id" : 0,
    "schemas" : [ {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    } ],
    "partition-spec" : [ ],
    "default-spec-id" : 0,
    "partition-specs" : [ {
      "spec-id" : 0,
      "fields" : [ ]
    } ],
    "last-partition-id" : 999,
    "default-sort-order-id" : 0,
    "sort-orders" : [ {
      "order-id" : 0,
      "fields" : [ ]
    } ],
    "properties" : { },
    "current-snapshot-id" : 6040080682987879495,
    "snapshots" : [ {
      "snapshot-id" : 6040080682987879495,
      "timestamp-ms" : 1650787052068,
      "summary" : {
        "operation" : "append",
        "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "added-records" : "1",
        "added-files-size" : "658",
        "changed-partition-count" : "1",
        "total-records" : "1",
        "total-files-size" : "658",
        "total-data-files" : "1",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro",
      "schema-id" : 0
    } ],
    "snapshot-log" : [ {
      "timestamp-ms" : 1650787052068,
      "snapshot-id" : 6040080682987879495
    } ],
    "metadata-log" : [ {
      "timestamp-ms" : 1650786845358,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json"
    } ]
    }
    

    可以看到 “current-snapshot-id” : 6040080682987879495,表示当前的 snapshot id。snapshots、snapshot-log 的列表都包含这个快照,metadata-log 包含了前一个版本的 metadata。当前快照的清单列表是 snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro

当前快照创建了一个数据文件并增加了一条数据

"added-data-files" : "1",
"added-records" : "1",


  • 查看快照文件
    curl -O https://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.10.2/avro-tools-1.10.2.jar
    
    :::info 查看 avro 格式的文件需要使用外部工具 avro-tools-1.10.2.jar :::
    java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
    {"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":6040080682987879495},"added_data_files_count":{"int":1},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":0},"partitions":{"array":[]},"added_rows_count":{"long":1},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":0}}
    
    可以看到快照文件只有一行数据

下面是 json 格式化后的结果,对应的清单文件 7a381735-54d2-402d-be25-0c09c6f35328-m0.avro

{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro",
    "manifest_length":5793,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":6040080682987879495
    },
    "added_data_files_count":{
        "int":1
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":0
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":1
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":0
    }
}
  • 查看清单文件 ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro

{“status”:1,”snapshot_id”:{“long”:6040080682987879495},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}

可以看到清单文件只有一行数据。

下面是 json 格式化后的结果,对应的数据文件 00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
    "status":1,
    "snapshot_id":{
        "long":6040080682987879495
    },
    "data_file":{
        "file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet",
        "file_format":"PARQUET",
        "partition":{

        },
        "record_count":1,
        "file_size_in_bytes":658,
        "block_size_in_bytes":67108864,
        "column_sizes":{
            "array":[
                {
                    "key":1,
                    "value":52
                },
                {
                    "key":2,
                    "value":52
                }
            ]
        },
        "value_counts":{
            "array":[
                {
                    "key":1,
                    "value":1
                },
                {
                    "key":2,
                    "value":1
                }
            ]
        },
        "null_value_counts":{
            "array":[
                {
                    "key":1,
                    "value":0
                },
                {
                    "key":2,
                    "value":0
                }
            ]
        },
        "nan_value_counts":{
            "array":[

            ]
        },
        "lower_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "upper_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "key_metadata":null,
        "split_offsets":{
            "array":[
                4
            ]
        },
        "sort_order_id":{
            "int":0
        }
    }
}

再插入一条数据

  • 写入数据

    INSERT INTO iceberg_table(id, data) VALUES (2, 'b');
    
  • 查看 HDFS 目录,此时新增了一个 v3.metadata.json,快照(snap-.avro),清单文件(-m0.avro)和数据文件(*.parquet)

    hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
    drwxr-xr-x   - root supergroup          0 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data
    -rw-r--r--   3 root supergroup        657 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet
    -rw-r--r--   3 root supergroup        658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet
    drwxr-xr-x   - root supergroup          0 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
    -rw-r--r--   3 root supergroup       5792 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro
    -rw-r--r--   3 root supergroup       5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
    -rw-r--r--   3 root supergroup       3848 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro
    -rw-r--r--   3 root supergroup       3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
    -rw-r--r--   3 root supergroup       1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
    -rw-r--r--   3 root supergroup       2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
    -rw-r--r--   3 root supergroup       3295 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
    -rw-r--r--   3 root supergroup          1 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
    
  • 查看 version-hint.text,此时已经变成了 3

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
    3
    
  • 查看 v3.metadata.json

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
    {
    "format-version" : 1,
    "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3",
    "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table",
    "last-updated-ms" : 1650788733525,
    "last-column-id" : 2,
    "schema" : {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    },
    "current-schema-id" : 0,
    "schemas" : [ {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    } ],
    "partition-spec" : [ ],
    "default-spec-id" : 0,
    "partition-specs" : [ {
      "spec-id" : 0,
      "fields" : [ ]
    } ],
    "last-partition-id" : 999,
    "default-sort-order-id" : 0,
    "sort-orders" : [ {
      "order-id" : 0,
      "fields" : [ ]
    } ],
    "properties" : { },
    "current-snapshot-id" : 2331249842714188418,
    "snapshots" : [ {
      "snapshot-id" : 6040080682987879495,
      "timestamp-ms" : 1650787052068,
      "summary" : {
        "operation" : "append",
        "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "added-records" : "1",
        "added-files-size" : "658",
        "changed-partition-count" : "1",
        "total-records" : "1",
        "total-files-size" : "658",
        "total-data-files" : "1",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro",
      "schema-id" : 0
    }, {
      "snapshot-id" : 2331249842714188418,
      "parent-snapshot-id" : 6040080682987879495,
      "timestamp-ms" : 1650788733525,
      "summary" : {
        "operation" : "append",
        "flink.job-id" : "f6d1cefdf0f8bea368d80eb9081f6649",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "added-records" : "1",
        "added-files-size" : "657",
        "changed-partition-count" : "1",
        "total-records" : "2",
        "total-files-size" : "1315",
        "total-data-files" : "2",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro",
      "schema-id" : 0
    } ],
    "snapshot-log" : [ {
      "timestamp-ms" : 1650787052068,
      "snapshot-id" : 6040080682987879495
    }, {
      "timestamp-ms" : 1650788733525,
      "snapshot-id" : 2331249842714188418
    } ],
    "metadata-log" : [ {
      "timestamp-ms" : 1650786845358,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json"
    }, {
      "timestamp-ms" : 1650787052068,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json"
    } ]
    }
    

    与第一条数据插入后的结果十分类似,可以看到 “current-snapshot-id” : 6040080682987879495。snapshots、snapshot-log 的列表都新增了这个快照。metadata-log 新增 v2.metadata.json。

当前快照创建了一个数据文件并增加了一条数据

"added-data-files" : "1",
"added-records" : "1",
  • 查看快照文件 ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro

{“manifest_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro”,”manifest_length”:5792,”partition_spec_id”:0,”added_snapshot_id”:{“long”:2331249842714188418},”added_data_files_count”:{“int”:1},”existing_data_files_count”:{“int”:0},”deleted_data_files_count”:{“int”:0},”partitions”:{“array”:[]},”added_rows_count”:{“long”:1},”existing_rows_count”:{“long”:0},”deleted_rows_count”:{“long”:0}} {“manifest_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro”,”manifest_length”:5793,”partition_spec_id”:0,”added_snapshot_id”:{“long”:6040080682987879495},”added_data_files_count”:{“int”:1},”existing_data_files_count”:{“int”:0},”deleted_data_files_count”:{“int”:0},”partitions”:{“array”:[]},”added_rows_count”:{“long”:1},”existing_rows_count”:{“long”:0},”deleted_rows_count”:{“long”:0}}

可以看到快照文件只有两行数据,分别对应本次和上一次插入

下面是第一行数据 json 格式化后的结果,对应的清单文件 2664ed23-930f-4510-9bf6-94d456155312-m0.avro
```bash
{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro",
    "manifest_length":5792,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":2331249842714188418
    },
    "added_data_files_count":{
        "int":1
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":0
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":1
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":0
    }
}

下面是第二行数据 json 格式化后的结果,对应的清单文件 7a381735-54d2-402d-be25-0c09c6f35328-m0.avro

{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro",
    "manifest_length":5793,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":6040080682987879495
    },
    "added_data_files_count":{
        "int":1
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":0
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":1
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":0
    }
}
  • 查看清单文件

第二个清单文件是第一次插入生成的,所以这里我们查看第一个清单文件

java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro

{"status":1,"snapshot_id":{"long":2331249842714188418},"data_file":{"file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet","file_format":"PARQUET","partition":{},"record_count":1,"file_size_in_bytes":657,"block_size_in_bytes":67108864,"column_sizes":{"array":[{"key":1,"value":51},{"key":2,"value":52}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"b"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"b"}]},"key_metadata":null,"split_offsets":{"array":[4]},"sort_order_id":{"int":0}}}

与第一次插入类似,可以看到清单文件只有一行数据。

下面是 json 格式化后的结果,对应的数据文件 00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)

{
    "status":1,
    "snapshot_id":{
        "long":2331249842714188418
    },
    "data_file":{
        "file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet",
        "file_format":"PARQUET",
        "partition":{

        },
        "record_count":1,
        "file_size_in_bytes":657,
        "block_size_in_bytes":67108864,
        "column_sizes":{
            "array":[
                {
                    "key":1,
                    "value":51
                },
                {
                    "key":2,
                    "value":52
                }
            ]
        },
        "value_counts":{
            "array":[
                {
                    "key":1,
                    "value":1
                },
                {
                    "key":2,
                    "value":1
                }
            ]
        },
        "null_value_counts":{
            "array":[
                {
                    "key":1,
                    "value":0
                },
                {
                    "key":2,
                    "value":0
                }
            ]
        },
        "nan_value_counts":{
            "array":[

            ]
        },
        "lower_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"b"
                }
            ]
        },
        "upper_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"b"
                }
            ]
        },
        "key_metadata":null,
        "split_offsets":{
            "array":[
                4
            ]
        },
        "sort_order_id":{
            "int":0
        }
    }
}

INSERT OVERWRITE

  • 覆盖数据 ```bash SET ‘execution.runtime-mode’ = ‘batch’;

INSERT OVERWRITE iceberg_table VALUES (1, ‘a’);


- 查看 HDFS 目录,此时新增了一个 v4.metadata.json,快照(snap-*.avro),3 个清单文件(*-m0.avro,*-m1.avro,*-m2.avro)和数据文件(*.parquet)
```bash
hdfs dfs -ls -R /warehouse/hadoop/iceberg_db/iceberg_table
drwxr-xr-x   - root supergroup          0 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/data
-rw-r--r--   3 root supergroup        657 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet
-rw-r--r--   3 root supergroup        658 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet
-rw-r--r--   3 root supergroup        658 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet
drwxr-xr-x   - root supergroup          0 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata
-rw-r--r--   3 root supergroup       5792 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/2664ed23-930f-4510-9bf6-94d456155312-m0.avro
-rw-r--r--   3 root supergroup       5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro
-rw-r--r--   3 root supergroup       5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro
-rw-r--r--   3 root supergroup       5793 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro
-rw-r--r--   3 root supergroup       5793 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/7a381735-54d2-402d-be25-0c09c6f35328-m0.avro
-rw-r--r--   3 root supergroup       3848 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro
-rw-r--r--   3 root supergroup       3776 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro
-rw-r--r--   3 root supergroup       3808 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro
-rw-r--r--   3 root supergroup       1176 2022-04-24 15:54 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json
-rw-r--r--   3 root supergroup       2218 2022-04-24 15:57 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json
-rw-r--r--   3 root supergroup       3295 2022-04-24 16:25 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json
-rw-r--r--   3 root supergroup       4513 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json
-rw-r--r--   3 root supergroup          1 2022-04-24 16:51 /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
  • 查看 version-hint.text,此时已经变成了 4

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/version-hint.text
    4
    
  • 查看 v4.metadata.json

    hdfs dfs -cat /warehouse/hadoop/iceberg_db/iceberg_table/metadata/v4.metadata.json
    {
    "format-version" : 1,
    "table-uuid" : "d3afd3a9-f37c-44a8-9846-b8a0a59272e3",
    "location" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table",
    "last-updated-ms" : 1650790279485,
    "last-column-id" : 2,
    "schema" : {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    },
    "current-schema-id" : 0,
    "schemas" : [ {
      "type" : "struct",
      "schema-id" : 0,
      "fields" : [ {
        "id" : 1,
        "name" : "id",
        "required" : false,
        "type" : "long"
      }, {
        "id" : 2,
        "name" : "data",
        "required" : false,
        "type" : "string"
      } ]
    } ],
    "partition-spec" : [ ],
    "default-spec-id" : 0,
    "partition-specs" : [ {
      "spec-id" : 0,
      "fields" : [ ]
    } ],
    "last-partition-id" : 999,
    "default-sort-order-id" : 0,
    "sort-orders" : [ {
      "order-id" : 0,
      "fields" : [ ]
    } ],
    "properties" : { },
    "current-snapshot-id" : 8430489177598754103,
    "snapshots" : [ {
      "snapshot-id" : 6040080682987879495,
      "timestamp-ms" : 1650787052068,
      "summary" : {
        "operation" : "append",
        "flink.job-id" : "9546c176d5418b18ee19c7fc6905152e",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "added-records" : "1",
        "added-files-size" : "658",
        "changed-partition-count" : "1",
        "total-records" : "1",
        "total-files-size" : "658",
        "total-data-files" : "1",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-6040080682987879495-1-7a381735-54d2-402d-be25-0c09c6f35328.avro",
      "schema-id" : 0
    }, {
      "snapshot-id" : 2331249842714188418,
      "parent-snapshot-id" : 6040080682987879495,
      "timestamp-ms" : 1650788733525,
      "summary" : {
        "operation" : "append",
        "flink.job-id" : "f6d1cefdf0f8bea368d80eb9081f6649",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "added-records" : "1",
        "added-files-size" : "657",
        "changed-partition-count" : "1",
        "total-records" : "2",
        "total-files-size" : "1315",
        "total-data-files" : "2",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-2331249842714188418-1-2664ed23-930f-4510-9bf6-94d456155312.avro",
      "schema-id" : 0
    }, {
      "snapshot-id" : 8430489177598754103,
      "parent-snapshot-id" : 2331249842714188418,
      "timestamp-ms" : 1650790279485,
      "summary" : {
        "operation" : "overwrite",
        "replace-partitions" : "true",
        "flink.job-id" : "27db50de8bf7d70905450e54300b3a41",
        "flink.max-committed-checkpoint-id" : "9223372036854775807",
        "added-data-files" : "1",
        "deleted-data-files" : "2",
        "added-records" : "1",
        "deleted-records" : "2",
        "added-files-size" : "658",
        "removed-files-size" : "1315",
        "changed-partition-count" : "1",
        "total-records" : "1",
        "total-files-size" : "658",
        "total-data-files" : "1",
        "total-delete-files" : "0",
        "total-position-deletes" : "0",
        "total-equality-deletes" : "0"
      },
      "manifest-list" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro",
      "schema-id" : 0
    } ],
    "snapshot-log" : [ {
      "timestamp-ms" : 1650787052068,
      "snapshot-id" : 6040080682987879495
    }, {
      "timestamp-ms" : 1650788733525,
      "snapshot-id" : 2331249842714188418
    }, {
      "timestamp-ms" : 1650790279485,
      "snapshot-id" : 8430489177598754103
    } ],
    "metadata-log" : [ {
      "timestamp-ms" : 1650786845358,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v1.metadata.json"
    }, {
      "timestamp-ms" : 1650787052068,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v2.metadata.json"
    }, {
      "timestamp-ms" : 1650788733525,
      "metadata-file" : "hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/v3.metadata.json"
    } ]
    }
    

    当前快照先删除了两个数据文件然后创建了一个数据文件并增加了一条数据,当前快照操作类型为 overwrite。 ```bash “added-data-files” : “1”, “deleted-data-files” : “2”, “added-records” : “1”,

“operation” : “overwrite”,


- 查看快照文件
```bash
java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/snap-8430489177598754103-1-70302424-0f36-41c1-9217-3da72ccc56dd.avro

{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":1},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":0},"partitions":{"array":[]},"added_rows_count":{"long":1},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":0}}
{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":0},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":1},"partitions":{"array":[]},"added_rows_count":{"long":0},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":1}}
{"manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro","manifest_length":5793,"partition_spec_id":0,"added_snapshot_id":{"long":8430489177598754103},"added_data_files_count":{"int":0},"existing_data_files_count":{"int":0},"deleted_data_files_count":{"int":1},"partitions":{"array":[]},"added_rows_count":{"long":0},"existing_rows_count":{"long":0},"deleted_rows_count":{"long":1}}

可以看到快照文件有三行数据,分别对应上一次和本次插入

下面是第一行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro

{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro",
    "manifest_length":5793,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":8430489177598754103
    },
    "added_data_files_count":{
        "int":1
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":0
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":1
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":0
    }
}

下面是第二行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro

{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro",
    "manifest_length":5793,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":8430489177598754103
    },
    "added_data_files_count":{
        "int":0
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":1
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":0
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":1
    }
}

下面是第三行数据 json 格式化后的结果,对应的清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro

{
    "manifest_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro",
    "manifest_length":5793,
    "partition_spec_id":0,
    "added_snapshot_id":{
        "long":8430489177598754103
    },
    "added_data_files_count":{
        "int":0
    },
    "existing_data_files_count":{
        "int":0
    },
    "deleted_data_files_count":{
        "int":1
    },
    "partitions":{
        "array":[

        ]
    },
    "added_rows_count":{
        "long":0
    },
    "existing_rows_count":{
        "long":0
    },
    "deleted_rows_count":{
        "long":1
    }
}
  • 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m0.avro

{“status”:2,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}


下面是 json 格式化后的结果,对应的数据文件 00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet(第一次插入生成的数据文件),status 设置成了 2(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
    "status":2,
    "snapshot_id":{
        "long":8430489177598754103
    },
    "data_file":{
        "file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-6b3231f5-6fbf-4bfd-86c2-5fea9530cf97-00001.parquet",
        "file_format":"PARQUET",
        "partition":{

        },
        "record_count":1,
        "file_size_in_bytes":658,
        "block_size_in_bytes":67108864,
        "column_sizes":{
            "array":[
                {
                    "key":1,
                    "value":52
                },
                {
                    "key":2,
                    "value":52
                }
            ]
        },
        "value_counts":{
            "array":[
                {
                    "key":1,
                    "value":1
                },
                {
                    "key":2,
                    "value":1
                }
            ]
        },
        "null_value_counts":{
            "array":[
                {
                    "key":1,
                    "value":0
                },
                {
                    "key":2,
                    "value":0
                }
            ]
        },
        "nan_value_counts":{
            "array":[

            ]
        },
        "lower_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "upper_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "key_metadata":null,
        "split_offsets":{
            "array":[
                4
            ]
        },
        "sort_order_id":{
            "int":0
        }
    }
}
  • 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m1.avro

{“status”:2,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:657,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:51},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”b”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”b”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}

下面是 json 格式化后的结果,对应的数据文件 00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet(第二次插入生成的数据文件),status 设置成了 2(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
    "status":2,
    "snapshot_id":{
        "long":8430489177598754103
    },
    "data_file":{
        "file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-2378883f-c101-42dd-bb49-b207f2466234-00001.parquet",
        "file_format":"PARQUET",
        "partition":{

        },
        "record_count":1,
        "file_size_in_bytes":657,
        "block_size_in_bytes":67108864,
        "column_sizes":{
            "array":[
                {
                    "key":1,
                    "value":51
                },
                {
                    "key":2,
                    "value":52
                }
            ]
        },
        "value_counts":{
            "array":[
                {
                    "key":1,
                    "value":1
                },
                {
                    "key":2,
                    "value":1
                }
            ]
        },
        "null_value_counts":{
            "array":[
                {
                    "key":1,
                    "value":0
                },
                {
                    "key":2,
                    "value":0
                }
            ]
        },
        "nan_value_counts":{
            "array":[

            ]
        },
        "lower_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"b"
                }
            ]
        },
        "upper_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"b"
                }
            ]
        },
        "key_metadata":null,
        "split_offsets":{
            "array":[
                4
            ]
        },
        "sort_order_id":{
            "int":0
        }
    }
}
  • 查看清单文件 70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro ```bash java -jar avro-tools-1.10.2.jar tojson hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/metadata/70302424-0f36-41c1-9217-3da72ccc56dd-m2.avro

{“status”:1,”snapshot_id”:{“long”:8430489177598754103},”data_file”:{“file_path”:”hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet”,”file_format”:”PARQUET”,”partition”:{},”record_count”:1,”file_size_in_bytes”:658,”block_size_in_bytes”:67108864,”column_sizes”:{“array”:[{“key”:1,”value”:52},{“key”:2,”value”:52}]},”value_counts”:{“array”:[{“key”:1,”value”:1},{“key”:2,”value”:1}]},”null_value_counts”:{“array”:[{“key”:1,”value”:0},{“key”:2,”value”:0}]},”nan_value_counts”:{“array”:[]},”lower_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”upper_bounds”:{“array”:[{“key”:1,”value”:”\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000”},{“key”:2,”value”:”a”}]},”key_metadata”:null,”split_offsets”:{“array”:[4]},”sort_order_id”:{“int”:0}}}


下面是 json 格式化后的结果,对应的数据文件 00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet,status 设置成了 1(0: EXISTING 1: ADDED 2: DELETED)
```bash
{
    "status":1,
    "snapshot_id":{
        "long":8430489177598754103
    },
    "data_file":{
        "file_path":"hdfs://datalake:9000/warehouse/hadoop/iceberg_db/iceberg_table/data/00000-0-afe1b4df-33a1-49d3-904c-0ba96815aeb4-00001.parquet",
        "file_format":"PARQUET",
        "partition":{

        },
        "record_count":1,
        "file_size_in_bytes":658,
        "block_size_in_bytes":67108864,
        "column_sizes":{
            "array":[
                {
                    "key":1,
                    "value":52
                },
                {
                    "key":2,
                    "value":52
                }
            ]
        },
        "value_counts":{
            "array":[
                {
                    "key":1,
                    "value":1
                },
                {
                    "key":2,
                    "value":1
                }
            ]
        },
        "null_value_counts":{
            "array":[
                {
                    "key":1,
                    "value":0
                },
                {
                    "key":2,
                    "value":0
                }
            ]
        },
        "nan_value_counts":{
            "array":[

            ]
        },
        "lower_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "upper_bounds":{
            "array":[
                {
                    "key":1,
                    "value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
                },
                {
                    "key":2,
                    "value":"a"
                }
            ]
        },
        "key_metadata":null,
        "split_offsets":{
            "array":[
                4
            ]
        },
        "sort_order_id":{
            "int":0
        }
    }
}