Hudi 集成使用教程

以下使用教程均为hudi版本0.10.1

集成 Hive

maven 仓库下载 hudi-hadoop-mr-bundle-0.10.1.jar,并放入hive安装路径下的lib包下,然后重启hive

Hudi集成使用教程 - 图1

集成 Spark

maven下载对应的spark-hudihudi-saprk{spark版本}-bundle_{scala版本}-{hudi版本}.jar

如我下载对应的是:hudi-spark3.1.2-bundle_2.12-0.10.1.jar

  1. ./bin/spark-sql \
  2. --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  3. --jars /opt/spark3/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar \
  4. --packages org.apache.spark:spark-avro_2.12:3.1.2 \
  5. --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

参数详解

  • —conf 配置:spark.serializer是指定序列化类
  • —jars 运行加载对应的jar包
  • —packages 与 —jar功能一样,只不过运行时会去默认maven仓库下载源下载包

运行测试

使用上面 Spark指令进入 spark-sql cli

  1. [root@cdh1 spark3]# ./bin/spark-sql \
  2. > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  3. > --jars /opt/spark3/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar \
  4. > --packages org.apache.spark:spark-avro_2.12:3.1.2 \
  5. > --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  6. :: loading settings :: url = jar:file:/opt/spark3/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
  7. Ivy Default Cache set to: /root/.ivy2/cache
  8. The jars for the packages stored in: /root/.ivy2/jars
  9. org.apache.spark#spark-avro_2.12 added as a dependency
  10. :: resolving dependencies :: org.apache.spark#spark-submit-parent-e30f3b1d-dacd-4844-a577-bebc7325d53a;1.0
  11. confs: [default]
  12. found org.apache.spark#spark-avro_2.12;3.1.2 in central
  13. found org.spark-project.spark#unused;1.0.0 in central
  14. :: resolution report :: resolve 286ms :: artifacts dl 4ms
  15. :: modules in use:
  16. org.apache.spark#spark-avro_2.12;3.1.2 from central in [default]
  17. org.spark-project.spark#unused;1.0.0 from central in [default]
  18. ---------------------------------------------------------------------
  19. | | modules || artifacts |
  20. | conf | number| search|dwnlded|evicted|| number|dwnlded|
  21. ---------------------------------------------------------------------
  22. | default | 2 | 0 | 0 | 0 || 2 | 0 |
  23. ---------------------------------------------------------------------
  24. :: retrieving :: org.apache.spark#spark-submit-parent-e30f3b1d-dacd-4844-a577-bebc7325d53a
  25. confs: [default]
  26. 0 artifacts copied, 2 already retrieved (0kB/12ms)
  27. Setting default log level to "WARN".
  28. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  29. ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/opt/cloudera/hive2/conf/ivysettings.xml will be used
  30. 1300 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  31. Spark master: local[*], Application Id: local-1648865248163
  32. spark-sql>
  1. 创建测试表
  1. create table cs1 (
  2. id int,
  3. name string,
  4. price double,
  5. dt string
  6. ) using hudi
  7. partitioned by (dt)
  8. options (
  9. primaryKey = 'id',
  10. type = 'mor'
  11. )
  12. location '/user/hudi/cs1';

location 指定hdfs存储路径

  1. 插入数据
    1. insert into cs1 values(1,'张三',10.25,"detail");


使用hive查看数据结构
Hudi集成使用教程 - 图2
会生成cs1 、 cs1_ro、cs_rt三张表

  • ro 读优化视图(read-optimized)只查询parquet 文件的数据
  • rt 实时时视图(real-time)查询parquet 和log 日志中的内容

此时 HDFS上面有数据了

Hudi集成使用教程 - 图3