Hudi 集成使用教程
以下使用教程均为hudi版本
0.10.1
集成 Hive
在 maven 仓库下载 hudi-hadoop-mr-bundle-0.10.1.jar,并放入hive安装路径下的lib包下,然后重启hive

集成 Spark
在maven下载对应的spark-hudi包 hudi-saprk{spark版本}-bundle_{scala版本}-{hudi版本}.jar
如我下载对应的是:hudi-spark3.1.2-bundle_2.12-0.10.1.jar
./bin/spark-sql \--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \--jars /opt/spark3/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar \--packages org.apache.spark:spark-avro_2.12:3.1.2 \--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
参数详解
- —conf 配置:spark.serializer是指定序列化类
- —jars 运行加载对应的jar包
- —packages 与 —jar功能一样,只不过运行时会去默认maven仓库下载源下载包
运行测试
使用上面 Spark指令进入 spark-sql cli
[root@cdh1 spark3]# ./bin/spark-sql \> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \> --jars /opt/spark3/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar \> --packages org.apache.spark:spark-avro_2.12:3.1.2 \> --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension':: loading settings :: url = jar:file:/opt/spark3/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlIvy Default Cache set to: /root/.ivy2/cacheThe jars for the packages stored in: /root/.ivy2/jarsorg.apache.spark#spark-avro_2.12 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent-e30f3b1d-dacd-4844-a577-bebc7325d53a;1.0confs: [default]found org.apache.spark#spark-avro_2.12;3.1.2 in centralfound org.spark-project.spark#unused;1.0.0 in central:: resolution report :: resolve 286ms :: artifacts dl 4ms:: modules in use:org.apache.spark#spark-avro_2.12;3.1.2 from central in [default]org.spark-project.spark#unused;1.0.0 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 2 | 0 | 0 | 0 || 2 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parent-e30f3b1d-dacd-4844-a577-bebc7325d53aconfs: [default]0 artifacts copied, 2 already retrieved (0kB/12ms)Setting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/opt/cloudera/hive2/conf/ivysettings.xml will be used1300 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSpark master: local[*], Application Id: local-1648865248163spark-sql>
- 创建测试表
create table cs1 (id int,name string,price double,dt string) using hudipartitioned by (dt)options (primaryKey = 'id',type = 'mor')location '/user/hudi/cs1';
location 指定hdfs存储路径
- 插入数据
insert into cs1 values(1,'张三',10.25,"detail");
使用hive查看数据结构
会生成cs1 、 cs1_ro、cs_rt三张表
- ro 读优化视图(read-optimized)只查询parquet 文件的数据
- rt 实时时视图(real-time)查询parquet 和log 日志中的内容
此时 HDFS上面有数据了

