- MantouDev
- Apache Atlas开发指南(中文版)
- Introduction
- 一、高级架构
- 二、类型系统
- 三、基本搜索
- 四、高级搜索
- 五、术语
- 六、安全
- 七、认证
- 八、Atlas授权模型
- 九、分类传播
- 十、配置Atlas属性
- 十一、通知
- 十二、Hook & Bridge
- 十三、容错和高可用选项
- 本书使用 GitBook 发布
12.3 Sqoop Hook
Apache Atlas Hook for Apache Sqoop
1. Sqoop 模型
Sqoop模型包含以下类型(types):
Entity types:
- sqoop_process
- super-types: Process
- attributes: qualifiedName, name, description, owner, inputs, outputs, operation, commandlineOpts, startTime, endTime, userName
- sqoop_dbdatastore
- super-types: DataSet
- attributes: qualifiedName, name, description, owner, dbStoreType, storeUse, storeUri, source
- sqoop_process
Enum types:
- sqoop_operation_type
- values: IMPORT, EXPORT, EVAL
- sqoop_dbstore_usage
- values: TABLE, QUERY, PROCEDURE, OTHER
- sqoop_operation_type
使用唯一属性qualifiedName在Atlas中创建和删除Sqoop实体,其值应格式如下所述。
sqoop_process.qualifiedName: sqoop <operation> --connect <url> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>]
sqoop_dbdatastore.qualifiedName: <storeType> --url <storeUri> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>] --hive-<operation> --hive-database <databaseName> [--hive-table <tableName>] --hive-cluster <clusterName>
2. Sqoop Hook
Sqoop添加了一个SqoopJobDataPublisher
,它在完成导入Job后将数据发布到Atlas。如今,SqoopHook仅支持hiveImport
,用来给上述的模型在Atlas中添加实体。 添加以下属性以在Sqoop中启用Atlas hook:
- 通过添加以下内容在
中设置Atlas hook:/sqoop-site.xml `
sqoop.job.data.publish.class org.apache.atlas.sqoop.hook.SqoopHook - 解压
apache-atlas-${project.version}-sqoop-hook.tar.gz
cd apache-atlas-sqoop-hook-${project.version}
- 将文件夹
apache-atlas-sqoop-hook-${project.version}/hook/sqoop
的全部内容复制到/hook/sqoop - 将
复制到sqoop/atlas-application.properties conf目录``
/ - 在sqoop lib中链接
/hook/sqoop/*.jar
atlas-application.properties
中的以下属性控制线程池和通知详细信息:
atlas.hook.sqoop.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Sqoop operation completion. Default: false
atlas.hook.sqoop.numRetries=3 # number of retries for notification failure. Default: 3
atlas.hook.sqoop.queueSize=10000 # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
可以通过在配置名称前加上“atlas.kafka”前缀来指定Kafka通知生成器的其他配置。有关Kafka制作人支持的配置列表,请参阅Kafka Producer Configs。
3. 注意
目前只有sqoop hook捕获以下sqoop操作
- hiveImport