如何在阿里云EMR上构建Iceberg数据湖

测试环境
第一步：用社区master版本编译iceberg-spark3-runtime-*.jar
第二步: 启动配置命令行，启动spark-sql客户端
第三步：创建数据库
第三步：创建测试表
第四步：插入测试记录
第五步：查询测试数据
第六步：检查OSS上的iceberg数据

测试环境

Aliyun EMR:
- 5.2.0版本；
- hadoop各组件版本参考文档；
Apache Iceberg:
- https://github.com/apache/iceberg 仓库的master分支；
- 对应的commit id: 40e626a168a3137b42d023fb8f8f3835f3cd0a78
Spark: 3.1.1
Hive: 3.1.2
Hadoop: 3.2.1

第一步：用社区master版本编译iceberg-spark3-runtime-*.jar

为何需要用master分支来编译？

因为Aliyun EMR 5.2.0版本的Spark版本对应社区的 3.1.1 版本，而Apache Iceberg 最近发布的 0.11.1 版本只支持 Spark 3.0.1 版本，和 Spark 3.1.1版本的兼容性有些问题。要想在Aliyun EMR 5.2.0版本上运行 Iceberg，那么至少需要用 0.12.0 （包含）及以上的版本。目前iceberg尚未发布0.12.0版本，那么选择用master分支来自己build一个iceberg-spark3-runtime-*.jar将是正确的选择。

$ git clone git@github.com:apache/iceberg.git
$ cd iceberg
$ ./gradlew build -x test
...
$ ls -atlr spark3-runtime/build/libs/iceberg-spark3-runtime-*.jar      
-rw-r--r--  1 openinx  staff  26679923 Jul 12 17:42 spark3-runtime/build/libs/iceberg-spark3-runtime-25eaeba.jar
-rw-r--r--  1 openinx  staff      5660 Jul 12 17:42 spark3-runtime/build/libs/iceberg-spark3-runtime-25eaeba-javadoc.jar
-rw-r--r--  1 openinx  staff      5660 Jul 12 17:42 spark3-runtime/build/libs/iceberg-spark3-runtime-25eaeba-sources.jar
-rw-r--r--  1 openinx  staff      5660 Jul 12 17:42 spark3-runtime/build/libs/iceberg-spark3-runtime-25eaeba-tests.jar

第二步: 启动配置命令行，启动spark-sql客户端

将编译好的 iceberg-spark3-runtime-xx.jar 上传到spark的master节点上（需要提前创建~/jars目录）:

scp spark3-runtime/build/libs/iceberg-spark3-runtime-25eaeba.jar  root@114.55.65.83:~/jars

通过如下命令启动 spark-sql：

#!/bin/bash
spark-sql --jars /root/jars/iceberg-spark3-runtime*.jar \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_prod.type=hive \
    --conf spark.sql.catalog.hive_prod.uri=thrift://emr-header-1.cluster-236286:9083 \
    --conf spark.sql.catalog.hive_prod.warehouse=oss://emr-iceberg/warehouse \
    --conf spark.sql.shuffle.partitions=2000

配置说明:

spark.sql.catalog.hive_prod: 这里hive_prod表示hive catalog的名字，用户也可以指定其他名字。
spark.sql.catalog.hive_prod.type: 表示catalog的类型，目前iceberg支持hive、hadoop、custom三种类型。
spark.sql.catalog.hive_prod.warehouse: 表示iceberg数据存放的oss位置，注意选择的bucket必须是EMR设定oss区域对应的bucket，例如我创建的EMR集群在【杭州】区域，同时我的bucket=emr-iceberg也落在杭州区域。（如果bucket区域和EMR区域不相同，则可能会碰到无法正确访问oss路径的问题）

spark.sql.catalog.hive_prod.uri: 表示hive-metastore的thrift uri地址。通过如下命令获取到hive-metastore的thrift uri地址：

[root@emr-header-1 ~]# cat /etc/ecm/hive-conf/hive-site.xml  | grep -1 hive.metastore.uris
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://emr-header-1.cluster-236286:9083</value>
</property>

spark.sql.shuffle.partitions: 表示spark做shuffle时的分区数。

第三步：创建数据库

CREATE DATABASE hive_prod.iceberg_db2;

第三步：创建测试表

CREATE TABLE hive_prod.iceberg_db2.sample 
(
  id    BIGINT,
  data  STRING
) 
USING iceberg;

第四步：插入测试记录

INSERT INTO hive_prod.iceberg_db2.sample VALUES (1, 'aaa'), (2, 'bbb');

第五步：查询测试数据

SELECT data, count(id) 
    FROM hive_prod.iceberg_db2.sample 
  GROUP BY data ;