Hive - 10.Hive的UDF - 《大数据成神之路》

一、UDF开发
二、启用Sentry的CDH集群中使用UDF

一、UDF开发

1.UDF开发步骤

1. 继承UDF类<br />    2. 重写evaluate函数<br />    3. 函数类型不能是void（必须要有返回类型,可以返回null,但是返回类型不能为void）<br />    4. 建议使用Text/LongWritable

UDF(user defined function 用户自定义函数)
注意：add jar /root/ry-dw-udf-1.0.0-SNAPSHOT.jar;
jar包的文件路径不需要用引号括起来，但jar包的路径需要为jar包当前文件系统的全路径。

2. 新建maven工程

** 修改pom依赖包

<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>0.13.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>0.13.1</version>
    </dependency>

3.编写UDF

eg：从url里面取出需要的字符串:
http://cms.yhd.com/sale/IhSwTYNxnzS?tc=ad.0.0.15116-32638141.1&tp=1.1.708.0.3.LEHaQW1
取出/sale后面的字符串IhSwTYNxnzS：sale后面的字符串表示销售ID

package com.myblue.myjava;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.hive.ql.exec.UDF;
public class GetSaleName extends UDF {
    public String evaluate(String url) {
        String str = null;
        Pattern p = Pattern.compile("sale/[a-zA-Z0-9]+");
        Matcher m = p.matcher(url);
        if (m.find()) {
            str = m.group(0).toLowerCase().split("/")[1];
        }
        return str;
    }
    public static void main(String[] args) {
        String url = "http://cms.yhd.com/sale/IhSwTYNxnzS?tc=ad.0.0.15116-32638141.1&tp=1.1.708.0.3.LEHaQW1";
        GetSaleName gs = new GetSaleName();
        System.out.println(gs.evaluate(url));
    }
}

4.使用UDF

--1.导入jar到Hive上：
hive> add jar /home/tom/myjars/lower.jar;
--2.创建临时函数：
hive> CREATE TEMPORARY FUNCTION my_lower AS 'com.myblue.myjava.Lower';
--3.使用指定函数：
hive> select my_lower(dname) from dept;
    ** 注意：若是hive重启，临时函数就不能用了。若想要再次使用，需得重新导入jar，再次生成函数才可以

UDF函数注意事项：
临时UDF函数只要退出当前hive就会失效
一般我们不会把UDF函数设置为永久生效

二、启用Sentry的CDH集群中使用UDF

1.背景

前置条件

1.集群Kerberos已启用
2.集群已安装Sentry服务且正常使用

大多数企业在使用CDH集群时，考虑数据的安全性会在集群中启用Sentry服务，这样就会导致之前正常使用的UDF函数无法正常使用。本篇文章主要讲述如何在Sentry环境下使用自定义UDF函数。

2.部署UDF JAR包

1.将开发好的UDF JAR包上传至HServer2及Metastore服务所在服务器统一目录

[root@servernode101 ~]# mkdir -p /usr/lib/hive/udf
[root@servernode101 udf]# cp /usr/lib/hive-udf-jars/ry-dw-udf-1.0.0.jar   /usr/lib/hive/udf
[root@servernode101 udf]# ll  /usr/lib/hive/udf
-rwxr-xr-x 1 root root 9272 7月  14 15:48 ry-dw-udf-1.0.0.jar
[root@servernode101 hive]# chown -R hdfs:hdfs /usr/lib/hive/udf
[root@servernode101 hive]# ll /usr/lib/hive/udf
-rwxr-xr-x 1 hdfs hdfs 9272 7月  14 15:48 ry-dw-udf-1.0.0.jar

注意：/usr/lib/hive/udf 目录及目录下文件的属主为hdfs，确保hdfs用户能访问

2.将开发好的UDF JAR上传至HDFS

[root@servernode101 udf]# hdfs dfs -ls /user/hive/lib/udf
-rwxrwxrwx   3 hdfs supergroup       9272 2018-07-12 14:36 /user/hive/lib/udf/ry-dw-udf-1.0.0.jar

注意： /user/hive/lib/udf和jar文件的所属用户必须为hdfs

3.Hive配置

1.登录CM管理控制台，进入Hive服务

点击配置，选择高级配置，在hive-site.xml文件中增加如下配置

<property>
    <name>hive.reloadable.aux.jars.path</name>
    <value>/usr/lib/hive/udf</value>
</property>

注意：hive.reloadable.aux.jars.path路径为本地的/usr/lib/hive/udf目录
10.Hive的UDF - 图1

2.保存配置，回到CM主页根据提示重启Hive服务

10.Hive的UDF - 图2
10.Hive的UDF - 图3

4.授权JAR文件

1.使用hdfs用户登录Hue管理台进行授权

10.Hive的UDF - 图4

2.进入Hive Tables管理页面，为hdfs角色增加授权

10.Hive的UDF - 图5

5.创建临时函数

通过beeline通过使用hive用户登录HiveServer2测试

[root@ip-172-31-22-86 ec2-user]# beeline 
Beeline version 1.1.0-cdh5.11.2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/;principal=hive/ip-172-31-22-86.ap-southeast-1.compute.internal@CLOUDERA.COM
scan complete in 2ms
...
0: jdbc:hive2://localhost:10000/> SELECT parse_date('2017-9-12 5:8:23', "yyyy-MM-dd HH:mm:ss")
. . . . . . . . . . . . . . . . > ;
Error: Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'parse_date' (state=42000,code=10011)
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> 
0: jdbc:hive2://localhost:10000/> create temporary function parse_date as 'com.peach.date.DateUtils';
...
INFO  : OK
No rows affected (0.154 seconds)
0: jdbc:hive2://localhost:10000/> SELECT parse_date('2017-9-12 5:8:23', "yyyy-MM-dd HH:mm:ss")
. . . . . . . . . . . . . . . . > ;
...
INFO  : OK
+----------------------+--+
|         _c0          |
+----------------------+--+
| 2017-09-12 05:08:23  |
+----------------------+--+
1 row selected (0.229 seconds)
0: jdbc:hive2://localhost:10000/>

6.创建永久函数

使用hdfs用户登录Hue，在tpcds_text库下创建test函数

CREATE  FUNCTION decoding as 'com.raiyi.dw.udf.func.encry.EncryptionDecodingUDF'

10.Hive的UDF - 图6

7.Impala使用Hive的自定义UDF

[ip-172-31-26-80.ap-southeast-1.compute.internal:21000] > create function parse_date2(string, string) returns string location '/user/hive/udfjars/sql-udf-utils-1.0-SNAPSHOT.jar' symbol='com.peach.date.DateUtils';
Query: create function parse_date2(string, string) returns string location '/user/hive/udfjars/sql-udf-utils-1.0-SNAPSHOT.jar' symbol='com.peach.date.DateUtils'
[ip-172-31-26-80.ap-southeast-1.compute.internal:21000] > SELECT parse_date2('2017-9-12 5:8:23', "yyyy-MM-dd HH:mm:ss");
Query: select parse_date2('2017-9-12 5:8:23', "yyyy-MM-dd HH:mm:ss")
Query submitted at: 2017-11-01 08:58:54 (Coordinator: http://ip-172-31-26-80.ap-southeast-1.compute.internal:25000)
Query progress can be monitored at: http://ip-172-31-26-80.ap-southeast-1.compute.internal:25000/query_plan?query_id=154799fb3ae4df01:3032775a00000000
+-------------------------------------------------------------------+
| tpcds_text.parse_date2('2017-9-12 5:8:23', 'yyyy-mm-dd hh:mm:ss') |
+-------------------------------------------------------------------+
| 2017-09-12 05:08:23                                               |
+-------------------------------------------------------------------+
Fetched 1 row(s) in 0.03s
[ip-172-31-26-80.ap-southeast-1.compute.internal:21000] >

总结：

在集群启用了Sentry后，使用Hive创建Function是不能使用USING JAR，所以在加载jar包时只能通过配置hive.reloadable.aux.jars.path路径。
创建的临时函数只能在当前会话使用，如果会话关闭则临时函数失效，使用Hue创建的临时函数在退出账号重新登录任然可以使用，重启HiveServer2则临时函数失效。
集群启用了Sentry服务，Hive创建函数时指定的是本地的jars，导致在Impala中无法直接使用Hive的函数，需要在Impala shell下重新创建。

另外需要注意**：
1.Hive**

为用户授权JAR文件的GRANT ALL ON URI特权，则用户就可以在他们拥有写权限的数据库上创建Function（即使用户没有GRANT ALL ON SERVER权限）
任何用户都可以DROP掉任何Function，不管它有什么权限，即使这个用户没有这个数据库的权限，也可以DROP掉这个数据库下的Function，只要带上Function的全路径，如：

DROP FUNCTION dbname.funcname
任何用户都可以使用创建好的Function，不管这个用户的权限，即使这个用户没有这个数据库的权限，只要带上function的全路径，就可以使用，如：

SELECT dbname.funcname()
2.Impala
只有拥有GRANT ALL ON SERVER权限的用户才能CREATE/DROP函数。
任何用户都可以使用创建好的Function，不管这个用户的权限，即使这个用户没有这个数据库的权限，只要带上function的全路径，就可以使用，如：

SELECT dbname.funcname()

参考资料
1.如何在启用Sentry的CDH集群中使用UDF