命令
sqoop级命令
$ bin/sqoop helpusage: sqoop COMMAND [ARGS]Available commands:codegen Generate code to interact with database recordscreate-hive-table Import a table definition into Hiveeval Evaluate a SQL statement and display the resultsexport Export an HDFS directory to a database tablehelp List available commandsimport Import a table from a database to HDFSimport-all-tables Import tables from a database to HDFSimport-mainframe Import datasets from a mainframe server to HDFSjob Work with saved jobslist-databases List available databases on a serverlist-tables List available tables in a databasemerge Merge results of incremental importsmetastore Run a standalone Sqoop metastoreversion Display version information
sqoop下级import命令
$ bin/sqoop help import # 或者 import --helpusage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]Common arguments:--connect <jdbc-uri> Specify JDBC connectstring--connection-manager <class-name> Specify connection managerclass name--connection-param-file <properties-file> Specify connectionparameters file--driver <class-name> Manually specify JDBCdriver class to use--hadoop-home <hdir> Override$HADOOP_MAPRED_HOME_ARG--hadoop-mapred-home <dir> Override$HADOOP_MAPRED_HOME_ARG--help Print usage instructions-P Read password from console--password <password> Set authenticationpassword--password-alias <password-alias> Credential providerpassword alias--password-file <password-file> Set authenticationpassword file path--relaxed-isolation Use read-uncommittedisolation for imports--skip-dist-cache Skip copying jars todistributed cache--username <username> Set authenticationusername--verbose Print more informationwhile workingImport control arguments:--append Imports datain appendmode<追加,导入数据时,每次导入的数据以另外的文件保存>--as-avrodatafile Imports datato Avro datafiles--as-parquetfile Imports datato Parquetfiles--as-sequencefile Imports datatoSequenceFiles--as-textfile Imports dataas plaintext(default)--boundary-query <statement> Set boundaryquery forretrievingmax and minvalue of theprimary key--columns <col,col,col...> Columns toimport fromtable--compression-codec <codec> Compressioncodec to usefor import--delete-target-dir Imports datain deletemode<删除模式,导入时,删除目标目录,"--append" and "--delete-target-dir" 不能一起使用>--direct Use directimport fastpath--direct-split-size <n> Split theinput streamevery 'n'bytes whenimporting indirect mode-e,--query <statement> Importresults ofSQL'statement'--fetch-size <n> Set number'n' of rowsto fetchfrom thedatabasewhen morerows areneeded--inline-lob-limit <n> Set themaximum sizefor aninline LOB-m,--num-mappers <n> Use 'n' maptasks toimport inparallel--mapreduce-job-name <name> Set name forgeneratedmapreducejob--merge-key <column> Key columnto use tojoin results<指定某一列,作为join结果的键,整合表下的多个数据文件part-m-00000、part-m-00001等,合并成一个>--split-by <column-name> Column ofthe tableused tosplit workunits--table <table-name> Table toread--target-dir <dir> HDFS plaintabledestination<导入时,HDFS中表数据的存储目录。HDFS普通表目的地>--validate Validate thecopy usingtheconfiguredvalidator--validation-failurehandler <validation-failurehandler> Fullyqualifiedclass nameforValidationFailureHandler--validation-threshold <validation-threshold> Fullyqualifiedclass nameforValidationThreshold--validator <validator> Fullyqualifiedclass namefor theValidator--warehouse-dir <dir> HDFS parentfor tabledestination--where <where clause> WHERE clauseto useduringimport-z,--compress EnablecompressionIncremental import arguments:--check-column <column> Source column to check for incrementalchange--incremental <import-type> Define an incremental import of type'append' or 'lastmodified'<增量导入类型><append 追加,追加文件的形式,放入表目录下><lastmodified 最后修改,需要几个条件><--check-column指定列类型是timestamp或date><需额外添加import控制参数:--append或--merge-key>--last-value <value> Last imported value in the incrementalcheck column<查找上限与下限边界的值,Lower bound, Lower bound>Output line formatting arguments:--enclosed-by <char> Sets a required field enclosingcharacter--escaped-by <char> Sets the escape character--fields-terminated-by <char> Sets the field separator character--lines-terminated-by <char> Sets the end-of-line character--mysql-delimiters Uses MySQL's default delimiter set:fields: , lines: \n escaped-by: \optionally-enclosed-by: '--optionally-enclosed-by <char> Sets a field enclosing characterInput parsing arguments:--input-enclosed-by <char> Sets a required field encloser--input-escaped-by <char> Sets the input escapecharacter--input-fields-terminated-by <char> Sets the input field separator--input-lines-terminated-by <char> Sets the input end-of-linechar--input-optionally-enclosed-by <char> Sets a field enclosingcharacterHive arguments:--create-hive-table Fail if the target hivetable exists--hive-database <database-name> Sets the database name touse when importing to hive--hive-delims-replacement <arg> Replace Hive record \0x01and row delimiters (\n\r)from imported string fieldswith user-defined string<替换导入的字符串字段中的十六进制字符“\0x01”和行分隔符“\n\r”>--hive-drop-import-delims Drop Hive record \0x01 androw delimiters (\n\r) fromimported string fields<与hive-delims-replacement冲突,去除“\0x01”和“\n\r”>--hive-home <dir> Override $HIVE_HOME--hive-import Import tables into Hive(Uses Hive's defaultdelimiters if none areset.)--hive-overwrite Overwrite existing data inthe Hive table--hive-partition-key <partition-key> Sets the partition key touse when importing to hive--hive-partition-value <partition-value> Sets the partition value touse when importing to hive--hive-table <table-name> Sets the table name to usewhen importing to hive--map-column-hive <arg> Override mapping forspecific column to hivetypes.HBase arguments:--column-family <family> Sets the target column family for theimport--hbase-bulkload Enables HBase bulk loading--hbase-create-table If specified, create missing HBase tables--hbase-row-key <col> Specifies which input column to use as therow key--hbase-table <table> Import to <table> in HBaseHCatalog arguments:--hcatalog-database <arg> HCatalog database name--hcatalog-home <hdir> Override $HCAT_HOME--hcatalog-partition-keys <partition-key> Sets the partitionkeys to use whenimporting to hive--hcatalog-partition-values <partition-value> Sets the partitionvalues to use whenimporting to hive--hcatalog-table <arg> HCatalog table name--hive-home <dir> Override $HIVE_HOME--hive-partition-key <partition-key> Sets the partition keyto use when importingto hive--hive-partition-value <partition-value> Sets the partitionvalue to use whenimporting to hive--map-column-hive <arg> Override mapping forspecific column tohive types.HCatalog import specific options:--create-hcatalog-table Create HCatalog before import--hcatalog-storage-stanza <arg> HCatalog storage stanza for tablecreationAccumulo arguments:--accumulo-batch-size <size> Batch size in bytes--accumulo-column-family <family> Sets the target column family forthe import--accumulo-create-table If specified, create missingAccumulo tables--accumulo-instance <instance> Accumulo instance name.--accumulo-max-latency <latency> Max write latency in milliseconds--accumulo-password <password> Accumulo password.--accumulo-row-key <col> Specifies which input column touse as the row key--accumulo-table <table> Import to <table> in Accumulo--accumulo-user <user> Accumulo user name.--accumulo-visibility <vis> Visibility token to be applied toall rows imported--accumulo-zookeepers <zookeepers> Comma-separated list ofzookeepers (host:port)Code generation arguments:--bindir <dir> Output directory for compiledobjects--class-name <name> Sets the generated class name.This overrides --package-name.When combined with --jar-file,sets the input class.--input-null-non-string <null-str> Input null non-stringrepresentation--input-null-string <null-str> Input null string representation--jar-file <file> Disable code generation; usespecified jar--map-column-java <arg> Override mapping for specificcolumns to java types--null-non-string <null-str> Null non-string representation--null-string <null-str> Null string representation--outdir <dir> Output directory for generatedcode--package-name <name> Put auto-generated classes inthis packageGeneric Hadoop command-line arguments:(must preceed any tool-specific arguments)Generic options supported are-conf <configuration file> specify an application configuration file-D <property=value> use value for given property-fs <local|namenode:port> specify a namenode-jt <local|jobtracker:port> specify a job tracker-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]At minimum, you must specify --connect and --tableArguments to mysqldump and other subprograms may be suppliedafter a '--' on the command line.
数据导入流程分析
导入hive分两步:导入用户目录下/user/jack/,创建table表目录,写入数据。导入hive表。
第一步:导入mysql.help_keyword的数据到hdfs的默认路径
第二步:自动仿造mysql.help_keyword去创建一张hive表, 创建在默认的default库中
第三步:把临时目录中的数据导入到hive表中
追加模式导入(—append)
两次导入数据。使用—append参数时,追加保存数据文件。
$ mysql -uroot -p123456mysql> select * from user;+----+-------+| id | name |+----+-------+| 1 | jack || 2 | tom || 3 | white || 4 | black |+----+-------+# 删除目录数据$ ~/Documents/hadoop/bin/hadoop fs -rm -r -f /user/sqoop/user# 第1次导入$ bin/sqoop import \--connect jdbc:mysql://master:3306/mydb \--username root \--password 123456 \--table user \--append \--target-dir /user/sqoop/user \--mysql-delimiters \--where 'id < 3' \--m 1# 第二次导入$ bin/sqoop import \--connect jdbc:mysql://master:3306/mydb \--username root \--password 123456 \--table user \--append \--target-dir /user/sqoop/user \--mysql-delimiters \--where 'id > 2 and id < 5' \--m 1# 查看结果$ ~/Documents/hadoop/bin/hadoop fs -ls /user/sqoop/userFound 2 items-rw-r--r-- 1 jack supergroup 13 2020-05-18 22:52 /user/sqoop/user/part-m-00000-rw-r--r-- 1 jack supergroup 16 2020-05-18 22:53 /user/sqoop/user/part-m-00001$ ~/Documents/hadoop/bin/hadoop fs -cat /user/sqoop/user/part-m-000001,jack2,tom$ ~/Documents/hadoop/bin/hadoop fs -cat /user/sqoop/user/part-m-000013,white4,black
增量更新id(int)
—last-value 需要按照实际情况更改。以时间(date、timestamp)或数字类型(id)的列来作为增量导入的判断标准时,会以指定的下界为标准,不按实际更新,可能会引发重复导入的问题。以时间类型的列作为判断标准时,超过机器当前时区的时间时,可能会忽略超过当前事件的行数据。
—incremental 指定增量模式,append模式(—incremental append或—incremental lastmodified —append)下,增量数据以另一个文件的形式保存在相同位置;merge模式(—incremental lastmodified —merge-key)下,以指定的列作为键,增量导入数据之后,合并所有文件。
导入第一批数据
$ mysql -uroot -p123456mysql> select * from user;+----+-------+| id | name |+----+-------+| 1 | jack || 2 | tom || 3 | white || 4 | black |+----+-------+# 导入一部分,预先$ vi opts/mysql_import.optimport--connectjdbc:mysql://master:3306/mydb--usernameroot--password123456--tableuser--delete-target-dir--hive-import--hive-overwrite--mysql-delimiters--where'id < 3'--m1# --delete-target-dir 导入hive失败时,删除写入临时目录的数据# --mysql-delimiters 使用MySQL的默认delimiter集合:# [fields: ,] [lines: \n] [escaped-by: \] [optionally-enclosed-by: ']## 执行sqoop$ bin/sqoop --options-file opts/mysql_import.opt# 查看数据$ ~/Documents/hive/bin/hivehive> select * from user;OK1 jack2 tomhive> dfs -cat /user/hive/warehouse/user/*;1,jack2,tom
增量导入(append)
# 修改sqoop脚本文件,增量导入$ vi opts/mysql_import_incremental.optimport--connectjdbc:mysql://master:3306/mydb--usernameroot--password123456--tableuser--m1--target-dir/user/hive/warehouse/user--mysql-delimiters--incrementalappend--check-columnid--last-value2# --hive-import 不能与 append混用,所以需要指定target-dir# incremental import模式,append方式,按照last-value的值来导入# 执行sqoop$ bin/sqoop --options-file opts/mysql_import_incremental.opt# 查看数据,append文件$ ~/Documents/hive/bin/hivehive> dfs -ls /user/hive/warehouse/user;Found 2 items-rwxrwxr-x 1 jack supergroup 13 2020-05-14 18:30 /user/hive/warehouse/user/part-m-00000-rw-r--r-- 1 jack supergroup 37 2020-05-14 18:36 /user/hive/warehouse/user/part-m-00001hive> dfs -cat /user/hive/warehouse/user/*;1,jack2,tom3,white4,blackhive> select * from user;OK1 jack2 tom3 white4 black# 再次执行,不对--last-value进行修改,依旧为2,结果如下所示# 此时check-column指定起始边界,所以上限与下限编辑诶依旧是 2~4,重复导入3、4数据。hive> select * from user;OK1 jack2 tom3 white4 black3 white4 black
增量导入(lastmodified)
mysql> CREATE TABLE `log` (-> `log_id` char(12) PRIMARY KEY,-> `create_time` timestamp,-> `content` varchar(255))mysql> INSERT INTO `log` VALUES ('abcabcabcabc', '2020-05-15 12:56:28', 'aaaaaaa');mysql> INSERT INTO `log` VALUES ('abcabcabcab1', '2020-05-20 12:56:32', 'bbbbbbb');mysql> INSERT INTO `log` VALUES ('abcabcabcab2', '2020-05-30 12:56:37', 'ccccccc');# 初始化hive表和数据$ vi opts/mysql_import_time.optimport--connectjdbc:mysql://master:3306/mydb--usernameroot--password123456--tablelog--m1--delete-target-dir--hive-import--mysql-delimiters# 执行,查看数据$ bin/sqoop --options-file opts/mysql_import_time.opt$ ~/Documents/hive/bin/hivehive> select * from log;OKabcabcabcabc 2020-05-15 12:56:28.0 aaaaaaaabcabcabcab1 2020-05-20 12:56:32.0 bbbbbbbabcabcabcab2 2020-05-30 12:56:37.0 cccccccTime taken: 0.078 seconds, Fetched: 4 row(s)hive> dfs -ls /user/hive/warehouse/log;Found 1 items-rwxrwxr-x 1 jack supergroup 129 2020-05-14 22:51 /user/hive/warehouse/log/part-m-00000# --hive-import 不能与 append混用,所以需要指定target-dir# incremental import模式,# lastmodified导入类型,以“时间”类型列,来比较增量导入# 缺少--append或--merge-key,指定数据的导入方式# --append 追加(文件),新增数据时,追加part-m-00000格式文件# --merge-key 合并(文件),新增数据时,合并part-m-00000格式文件# --merge-key or --append is required when using --incremental lastmodified and the output directory exists.# 报错:Column type is neither timestamp nor date!# 原因,--check-column指定列,只能是timestamp或date类型# 列类型错误,修改last-value的值,4 -> "2020-01-02 22:20:38"# timestamp超过机器的现实时间,也不会读取,追加写入,# 当前机器时间实际是2020-05-14 22:52:28,那么追加之后的数据就不会成功# 先前初始化导入的数据没有问题。# append模式# 新增数据,理论导出abcabcabcab3,不导入abcabcabcab4,因为机器时间未到(具体查看源码实现,我没看,猜的)mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES-> ('abcabcabcab3', '2020-05-07 13:44:38', 'dddddd');mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES-> ('abcabcabcab4', '2020-05-31 13:19:58', 'dddddddd');# 修改sqoop脚本文件,增量导入$ vi opts/mysql_import_time_incremental.optimport--connectjdbc:mysql://master:3306/mydb--usernameroot--password123456--tablelog--m1--target-dir/user/hive/warehouse/log--mysql-delimiters--incrementallastmodified--check-columncreate_time--last-value"2020-01-14 22:34:53.0"--append# 排除已有的,找到已更新的,且合理的(timestamp小于等于实际时间),只有abcabcabcab3。# 执行sqoop$ bin/sqoop --options-file opts/mysql_import_time_incremental.opt# 查看数据,append文件$ ~/Documents/hive/bin/hivehive> dfs -ls /user/hive/warehouse/log;Found 2 items-rwxrwxr-x 1 jack supergroup 129 2020-05-14 22:51 /user/hive/warehouse/log/part-m-00000-rw-r--r-- 1 jack supergroup 42 2020-05-14 22:53 /user/hive/warehouse/log/part-m-00001hive> select * from log;OKabcabcabcabc 2020-05-15 12:56:28.0 aaaaaaaabcabcabcab1 2020-05-20 12:56:32.0 bbbbbbbabcabcabcab2 2020-05-30 12:56:37.0 cccccccabcabcabcab3 2020-05-07 13:44:38.0 dddddd# merge-key模式# 该模式下,将整合所有文件为一个文件。# 新增数据,理论导出 abcabcabcab5 与 abcabcabcab6mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES-> ('abcabcabcab5', '2020-05-08 13:44:38', 'dddddd');mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES-> ('abcabcabcab6', '2020-05-09 13:19:58', 'dddddddd');# 修改sqoop脚本文件,增量导入$ vi opts/mysql_import_time_incremental.optimport--connectjdbc:mysql://master:3306/mydb--usernameroot--password123456--tablelog--m1--target-dir/user/hive/warehouse/log--mysql-delimiters--incrementallastmodified--check-columncreate_time--last-value"2020-01-02 22:20:38"--merge-keylog_id# 执行sqoop$ bin/sqoop --options-file opts/mysql_import_time_incremental.opt# 查看数据,append文件$ ~/Documents/hive/bin/hivehive> dfs -ls /user/hive/warehouse/log;Found 2 items-rw-r--r-- 1 jack supergroup 0 2020-05-14 22:56 /user/hive/warehouse/log/_SUCCESS-rw-r--r-- 1 jack supergroup 257 2020-05-14 22:56 /user/hive/warehouse/log/part-r-00000hive> select * from log;OKabcabcabcab1 2020-05-20 12:56:32.0 bbbbbbbabcabcabcab2 2020-05-30 12:56:37.0 cccccccabcabcabcab3 2020-05-07 13:44:38.0 ddddddabcabcabcab5 2020-05-08 13:44:38.0 ddddddabcabcabcab6 2020-05-09 13:19:58.0 ddddddddabcabcabcabc 2020-05-15 12:56:28.0 aaaaaaa# 重复导入# 以append模式,再次导入,修改--merge-key为--appendhive> dfs -ls /user/hive/warehouse/log;Found 3 items-rw-r--r-- 1 jack supergroup 0 2020-05-14 22:56 /user/hive/warehouse/log/_SUCCESS-rw-r--r-- 1 jack supergroup 128 2020-05-14 23:17 /user/hive/warehouse/log/part-m-00001-rw-r--r-- 1 jack supergroup 257 2020-05-14 22:56 /user/hive/warehouse/log/part-r-00000hive> dfs -cat /user/hive/warehouse/log/*;abcabcabcab3,2020-05-07 13:44:38.0,ddddddabcabcabcab5,2020-05-08 13:44:38.0,ddddddabcabcabcab6,2020-05-09 13:19:58.0,ddddddddabcabcabcab1,2020-05-20 12:56:32.0,bbbbbbbabcabcabcab2,2020-05-30 12:56:37.0,cccccccabcabcabcab3,2020-05-07 13:44:38.0,ddddddabcabcabcab5,2020-05-08 13:44:38.0,ddddddabcabcabcab6,2020-05-09 13:19:58.0,ddddddddabcabcabcabc,2020-05-15 12:56:28.0,aaaaaaa# 合并,去除重复# 以merge模式,再次导入,修改--append为--merge-keyhive> select * from log;OKabcabcabcab1 2020-05-20 12:56:32.0 bbbbbbbabcabcabcab2 2020-05-30 12:56:37.0 cccccccabcabcabcab3 2020-05-07 13:44:38.0 ddddddabcabcabcab5 2020-05-08 13:44:38.0 ddddddabcabcabcab6 2020-05-09 13:19:58.0 ddddddddabcabcabcabc 2020-05-15 12:56:28.0 aaaaaaaTime taken: 0.055 seconds, Fetched: 6 row(s)hive> dfs -ls /user/hive/warehouse/log;Found 2 items-rw-r--r-- 1 jack supergroup 0 2020-05-14 23:23 /user/hive/warehouse/log/_SUCCESS-rw-r--r-- 1 jack supergroup 257 2020-05-14 23:23 /user/hive/warehouse/log/part-r-00000
sqoop任务
创建job
增量导入hive,之前,需创建hive表,或提前导入部分数据,自动创建hive表。
# 初始化数据$ mysql -uroot -p123456DROP TABLE IF EXISTS `user`;CREATE TABLE `user` (`id` int(11) NOT NULL,`name` varchar(20) DEFAULT NULL,PRIMARY KEY (`id`)) ENGINE=MyISAM DEFAULT CHARSET=latin1;INSERT INTO `user` VALUES ('1', 'jack');INSERT INTO `user` VALUES ('2', 'tom');INSERT INTO `user` VALUES ('3', 'white');INSERT INTO `user` VALUES ('4', 'black');$ bin/sqoop job \--create user_import_incr \--meta-connect jdbc.mysql://master:3306/sqoop \--import \--connect jdbc:mysql://node03:3306/userdb \--username root --password 123456 \--table emp \--incremental append \--check-column id \--last-value 1202 \--target-dir /sqoop/increment/emp \-m 1
参考
未开启hive元数据服务,数据加载到hive时失败
# 启动hive元数据服务,配置元数据服务时才需要$ ~/Documents/hive/bin/hive --service metastore# hive元数据服务未启动,导入失败,删除第一步的临时存放目录$ ~/Documents/hadoop/bin/hdfs dfs -rm -r -f /user/jack/user# 手动加载临时数据文件到hive表中
