命令

sqoop级命令

  1. $ bin/sqoop help
  2. usage: sqoop COMMAND [ARGS]
  3. Available commands:
  4. codegen Generate code to interact with database records
  5. create-hive-table Import a table definition into Hive
  6. eval Evaluate a SQL statement and display the results
  7. export Export an HDFS directory to a database table
  8. help List available commands
  9. import Import a table from a database to HDFS
  10. import-all-tables Import tables from a database to HDFS
  11. import-mainframe Import datasets from a mainframe server to HDFS
  12. job Work with saved jobs
  13. list-databases List available databases on a server
  14. list-tables List available tables in a database
  15. merge Merge results of incremental imports
  16. metastore Run a standalone Sqoop metastore
  17. version Display version information

sqoop下级import命令

  1. $ bin/sqoop help import # 或者 import --help
  2. usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
  3. Common arguments:
  4. --connect <jdbc-uri> Specify JDBC connect
  5. string
  6. --connection-manager <class-name> Specify connection manager
  7. class name
  8. --connection-param-file <properties-file> Specify connection
  9. parameters file
  10. --driver <class-name> Manually specify JDBC
  11. driver class to use
  12. --hadoop-home <hdir> Override
  13. $HADOOP_MAPRED_HOME_ARG
  14. --hadoop-mapred-home <dir> Override
  15. $HADOOP_MAPRED_HOME_ARG
  16. --help Print usage instructions
  17. -P Read password from console
  18. --password <password> Set authentication
  19. password
  20. --password-alias <password-alias> Credential provider
  21. password alias
  22. --password-file <password-file> Set authentication
  23. password file path
  24. --relaxed-isolation Use read-uncommitted
  25. isolation for imports
  26. --skip-dist-cache Skip copying jars to
  27. distributed cache
  28. --username <username> Set authentication
  29. username
  30. --verbose Print more information
  31. while working
  32. Import control arguments:
  33. --append Imports data
  34. in append
  35. mode
  36. <追加,导入数据时,每次导入的数据以另外的文件保存>
  37. --as-avrodatafile Imports data
  38. to Avro data
  39. files
  40. --as-parquetfile Imports data
  41. to Parquet
  42. files
  43. --as-sequencefile Imports data
  44. to
  45. SequenceFile
  46. s
  47. --as-textfile Imports data
  48. as plain
  49. text
  50. (default)
  51. --boundary-query <statement> Set boundary
  52. query for
  53. retrieving
  54. max and min
  55. value of the
  56. primary key
  57. --columns <col,col,col...> Columns to
  58. import from
  59. table
  60. --compression-codec <codec> Compression
  61. codec to use
  62. for import
  63. --delete-target-dir Imports data
  64. in delete
  65. mode
  66. <删除模式,导入时,删除目标目录,"--append" and "--delete-target-dir" 不能一起使用>
  67. --direct Use direct
  68. import fast
  69. path
  70. --direct-split-size <n> Split the
  71. input stream
  72. every 'n'
  73. bytes when
  74. importing in
  75. direct mode
  76. -e,--query <statement> Import
  77. results of
  78. SQL
  79. 'statement'
  80. --fetch-size <n> Set number
  81. 'n' of rows
  82. to fetch
  83. from the
  84. database
  85. when more
  86. rows are
  87. needed
  88. --inline-lob-limit <n> Set the
  89. maximum size
  90. for an
  91. inline LOB
  92. -m,--num-mappers <n> Use 'n' map
  93. tasks to
  94. import in
  95. parallel
  96. --mapreduce-job-name <name> Set name for
  97. generated
  98. mapreduce
  99. job
  100. --merge-key <column> Key column
  101. to use to
  102. join results
  103. <指定某一列,作为join结果的键,整合表下的多个数据文件part-m-00000part-m-00001等,合并成一个>
  104. --split-by <column-name> Column of
  105. the table
  106. used to
  107. split work
  108. units
  109. --table <table-name> Table to
  110. read
  111. --target-dir <dir> HDFS plain
  112. table
  113. destination
  114. <导入时,HDFS中表数据的存储目录。HDFS普通表目的地>
  115. --validate Validate the
  116. copy using
  117. the
  118. configured
  119. validator
  120. --validation-failurehandler <validation-failurehandler> Fully
  121. qualified
  122. class name
  123. for
  124. ValidationFa
  125. ilureHandler
  126. --validation-threshold <validation-threshold> Fully
  127. qualified
  128. class name
  129. for
  130. ValidationTh
  131. reshold
  132. --validator <validator> Fully
  133. qualified
  134. class name
  135. for the
  136. Validator
  137. --warehouse-dir <dir> HDFS parent
  138. for table
  139. destination
  140. --where <where clause> WHERE clause
  141. to use
  142. during
  143. import
  144. -z,--compress Enable
  145. compression
  146. Incremental import arguments:
  147. --check-column <column> Source column to check for incremental
  148. change
  149. --incremental <import-type> Define an incremental import of type
  150. 'append' or 'lastmodified'
  151. <增量导入类型>
  152. <append 追加,追加文件的形式,放入表目录下>
  153. <lastmodified 最后修改,需要几个条件>
  154. <--check-column指定列类型是timestampdate>
  155. <需额外添加import控制参数:--append或--merge-key>
  156. --last-value <value> Last imported value in the incremental
  157. check column
  158. <查找上限与下限边界的值,Lower bound, Lower bound>
  159. Output line formatting arguments:
  160. --enclosed-by <char> Sets a required field enclosing
  161. character
  162. --escaped-by <char> Sets the escape character
  163. --fields-terminated-by <char> Sets the field separator character
  164. --lines-terminated-by <char> Sets the end-of-line character
  165. --mysql-delimiters Uses MySQL's default delimiter set:
  166. fields: , lines: \n escaped-by: \
  167. optionally-enclosed-by: '
  168. --optionally-enclosed-by <char> Sets a field enclosing character
  169. Input parsing arguments:
  170. --input-enclosed-by <char> Sets a required field encloser
  171. --input-escaped-by <char> Sets the input escape
  172. character
  173. --input-fields-terminated-by <char> Sets the input field separator
  174. --input-lines-terminated-by <char> Sets the input end-of-line
  175. char
  176. --input-optionally-enclosed-by <char> Sets a field enclosing
  177. character
  178. Hive arguments:
  179. --create-hive-table Fail if the target hive
  180. table exists
  181. --hive-database <database-name> Sets the database name to
  182. use when importing to hive
  183. --hive-delims-replacement <arg> Replace Hive record \0x01
  184. and row delimiters (\n\r)
  185. from imported string fields
  186. with user-defined string
  187. <替换导入的字符串字段中的十六进制字符“\0x01”和行分隔符“\n\r”>
  188. --hive-drop-import-delims Drop Hive record \0x01 and
  189. row delimiters (\n\r) from
  190. imported string fields
  191. <与hive-delims-replacement冲突,去除“\0x01”和“\n\r”>
  192. --hive-home <dir> Override $HIVE_HOME
  193. --hive-import Import tables into Hive
  194. (Uses Hive's default
  195. delimiters if none are
  196. set.)
  197. --hive-overwrite Overwrite existing data in
  198. the Hive table
  199. --hive-partition-key <partition-key> Sets the partition key to
  200. use when importing to hive
  201. --hive-partition-value <partition-value> Sets the partition value to
  202. use when importing to hive
  203. --hive-table <table-name> Sets the table name to use
  204. when importing to hive
  205. --map-column-hive <arg> Override mapping for
  206. specific column to hive
  207. types.
  208. HBase arguments:
  209. --column-family <family> Sets the target column family for the
  210. import
  211. --hbase-bulkload Enables HBase bulk loading
  212. --hbase-create-table If specified, create missing HBase tables
  213. --hbase-row-key <col> Specifies which input column to use as the
  214. row key
  215. --hbase-table <table> Import to <table> in HBase
  216. HCatalog arguments:
  217. --hcatalog-database <arg> HCatalog database name
  218. --hcatalog-home <hdir> Override $HCAT_HOME
  219. --hcatalog-partition-keys <partition-key> Sets the partition
  220. keys to use when
  221. importing to hive
  222. --hcatalog-partition-values <partition-value> Sets the partition
  223. values to use when
  224. importing to hive
  225. --hcatalog-table <arg> HCatalog table name
  226. --hive-home <dir> Override $HIVE_HOME
  227. --hive-partition-key <partition-key> Sets the partition key
  228. to use when importing
  229. to hive
  230. --hive-partition-value <partition-value> Sets the partition
  231. value to use when
  232. importing to hive
  233. --map-column-hive <arg> Override mapping for
  234. specific column to
  235. hive types.
  236. HCatalog import specific options:
  237. --create-hcatalog-table Create HCatalog before import
  238. --hcatalog-storage-stanza <arg> HCatalog storage stanza for table
  239. creation
  240. Accumulo arguments:
  241. --accumulo-batch-size <size> Batch size in bytes
  242. --accumulo-column-family <family> Sets the target column family for
  243. the import
  244. --accumulo-create-table If specified, create missing
  245. Accumulo tables
  246. --accumulo-instance <instance> Accumulo instance name.
  247. --accumulo-max-latency <latency> Max write latency in milliseconds
  248. --accumulo-password <password> Accumulo password.
  249. --accumulo-row-key <col> Specifies which input column to
  250. use as the row key
  251. --accumulo-table <table> Import to <table> in Accumulo
  252. --accumulo-user <user> Accumulo user name.
  253. --accumulo-visibility <vis> Visibility token to be applied to
  254. all rows imported
  255. --accumulo-zookeepers <zookeepers> Comma-separated list of
  256. zookeepers (host:port)
  257. Code generation arguments:
  258. --bindir <dir> Output directory for compiled
  259. objects
  260. --class-name <name> Sets the generated class name.
  261. This overrides --package-name.
  262. When combined with --jar-file,
  263. sets the input class.
  264. --input-null-non-string <null-str> Input null non-string
  265. representation
  266. --input-null-string <null-str> Input null string representation
  267. --jar-file <file> Disable code generation; use
  268. specified jar
  269. --map-column-java <arg> Override mapping for specific
  270. columns to java types
  271. --null-non-string <null-str> Null non-string representation
  272. --null-string <null-str> Null string representation
  273. --outdir <dir> Output directory for generated
  274. code
  275. --package-name <name> Put auto-generated classes in
  276. this package
  277. Generic Hadoop command-line arguments:
  278. (must preceed any tool-specific arguments)
  279. Generic options supported are
  280. -conf <configuration file> specify an application configuration file
  281. -D <property=value> use value for given property
  282. -fs <local|namenode:port> specify a namenode
  283. -jt <local|jobtracker:port> specify a job tracker
  284. -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
  285. -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
  286. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  287. The general command line syntax is
  288. bin/hadoop command [genericOptions] [commandOptions]
  289. At minimum, you must specify --connect and --table
  290. Arguments to mysqldump and other subprograms may be supplied
  291. after a '--' on the command line.

数据导入流程分析

导入hive分两步:导入用户目录下/user/jack/,创建table表目录,写入数据。导入hive表。

第一步:导入mysql.help_keyword的数据到hdfs的默认路径
第二步:自动仿造mysql.help_keyword去创建一张hive表, 创建在默认的default库中
第三步:把临时目录中的数据导入到hive表中

数据库连接
import目录
分隔符
import模式
map数量

追加模式导入(—append)

两次导入数据。使用—append参数时,追加保存数据文件。

  1. $ mysql -uroot -p123456
  2. mysql> select * from user;
  3. +----+-------+
  4. | id | name |
  5. +----+-------+
  6. | 1 | jack |
  7. | 2 | tom |
  8. | 3 | white |
  9. | 4 | black |
  10. +----+-------+
  11. # 删除目录数据
  12. $ ~/Documents/hadoop/bin/hadoop fs -rm -r -f /user/sqoop/user
  13. # 第1次导入
  14. $ bin/sqoop import \
  15. --connect jdbc:mysql://master:3306/mydb \
  16. --username root \
  17. --password 123456 \
  18. --table user \
  19. --append \
  20. --target-dir /user/sqoop/user \
  21. --mysql-delimiters \
  22. --where 'id < 3' \
  23. --m 1
  24. # 第二次导入
  25. $ bin/sqoop import \
  26. --connect jdbc:mysql://master:3306/mydb \
  27. --username root \
  28. --password 123456 \
  29. --table user \
  30. --append \
  31. --target-dir /user/sqoop/user \
  32. --mysql-delimiters \
  33. --where 'id > 2 and id < 5' \
  34. --m 1
  35. # 查看结果
  36. $ ~/Documents/hadoop/bin/hadoop fs -ls /user/sqoop/user
  37. Found 2 items
  38. -rw-r--r-- 1 jack supergroup 13 2020-05-18 22:52 /user/sqoop/user/part-m-00000
  39. -rw-r--r-- 1 jack supergroup 16 2020-05-18 22:53 /user/sqoop/user/part-m-00001
  40. $ ~/Documents/hadoop/bin/hadoop fs -cat /user/sqoop/user/part-m-00000
  41. 1,jack
  42. 2,tom
  43. $ ~/Documents/hadoop/bin/hadoop fs -cat /user/sqoop/user/part-m-00001
  44. 3,white
  45. 4,black

增量更新id(int)

—last-value 需要按照实际情况更改。以时间(date、timestamp)或数字类型(id)的列来作为增量导入的判断标准时,会以指定的下界为标准,不按实际更新,可能会引发重复导入的问题。以时间类型的列作为判断标准时,超过机器当前时区的时间时,可能会忽略超过当前事件的行数据。
—incremental 指定增量模式,append模式(—incremental append或—incremental lastmodified —append)下,增量数据以另一个文件的形式保存在相同位置;merge模式(—incremental lastmodified —merge-key)下,以指定的列作为键,增量导入数据之后,合并所有文件。

导入第一批数据

  1. $ mysql -uroot -p123456
  2. mysql> select * from user;
  3. +----+-------+
  4. | id | name |
  5. +----+-------+
  6. | 1 | jack |
  7. | 2 | tom |
  8. | 3 | white |
  9. | 4 | black |
  10. +----+-------+
  11. # 导入一部分,预先
  12. $ vi opts/mysql_import.opt
  13. import
  14. --connect
  15. jdbc:mysql://master:3306/mydb
  16. --username
  17. root
  18. --password
  19. 123456
  20. --table
  21. user
  22. --delete-target-dir
  23. --hive-import
  24. --hive-overwrite
  25. --mysql-delimiters
  26. --where
  27. 'id < 3'
  28. --m
  29. 1
  30. # --delete-target-dir 导入hive失败时,删除写入临时目录的数据
  31. # --mysql-delimiters 使用MySQL的默认delimiter集合:
  32. # [fields: ,] [lines: \n] [escaped-by: \] [optionally-enclosed-by: ']
  33. #
  34. # 执行sqoop
  35. $ bin/sqoop --options-file opts/mysql_import.opt
  36. # 查看数据
  37. $ ~/Documents/hive/bin/hive
  38. hive> select * from user;
  39. OK
  40. 1 jack
  41. 2 tom
  42. hive> dfs -cat /user/hive/warehouse/user/*;
  43. 1,jack
  44. 2,tom

增量导入(append)

  1. # 修改sqoop脚本文件,增量导入
  2. $ vi opts/mysql_import_incremental.opt
  3. import
  4. --connect
  5. jdbc:mysql://master:3306/mydb
  6. --username
  7. root
  8. --password
  9. 123456
  10. --table
  11. user
  12. --m
  13. 1
  14. --target-dir
  15. /user/hive/warehouse/user
  16. --mysql-delimiters
  17. --incremental
  18. append
  19. --check-column
  20. id
  21. --last-value
  22. 2
  23. # --hive-import 不能与 append混用,所以需要指定target-dir
  24. # incremental import模式,append方式,按照last-value的值来导入
  25. # 执行sqoop
  26. $ bin/sqoop --options-file opts/mysql_import_incremental.opt
  27. # 查看数据,append文件
  28. $ ~/Documents/hive/bin/hive
  29. hive> dfs -ls /user/hive/warehouse/user;
  30. Found 2 items
  31. -rwxrwxr-x 1 jack supergroup 13 2020-05-14 18:30 /user/hive/warehouse/user/part-m-00000
  32. -rw-r--r-- 1 jack supergroup 37 2020-05-14 18:36 /user/hive/warehouse/user/part-m-00001
  33. hive> dfs -cat /user/hive/warehouse/user/*;
  34. 1,jack
  35. 2,tom
  36. 3,white
  37. 4,black
  38. hive> select * from user;
  39. OK
  40. 1 jack
  41. 2 tom
  42. 3 white
  43. 4 black
  44. # 再次执行,不对--last-value进行修改,依旧为2,结果如下所示
  45. # 此时check-column指定起始边界,所以上限与下限编辑诶依旧是 2~4,重复导入3、4数据。
  46. hive> select * from user;
  47. OK
  48. 1 jack
  49. 2 tom
  50. 3 white
  51. 4 black
  52. 3 white
  53. 4 black

增量导入(lastmodified)

  1. mysql> CREATE TABLE `log` (
  2. -> `log_id` char(12) PRIMARY KEY,
  3. -> `create_time` timestamp,
  4. -> `content` varchar(255))
  5. mysql> INSERT INTO `log` VALUES ('abcabcabcabc', '2020-05-15 12:56:28', 'aaaaaaa');
  6. mysql> INSERT INTO `log` VALUES ('abcabcabcab1', '2020-05-20 12:56:32', 'bbbbbbb');
  7. mysql> INSERT INTO `log` VALUES ('abcabcabcab2', '2020-05-30 12:56:37', 'ccccccc');
  8. # 初始化hive表和数据
  9. $ vi opts/mysql_import_time.opt
  10. import
  11. --connect
  12. jdbc:mysql://master:3306/mydb
  13. --username
  14. root
  15. --password
  16. 123456
  17. --table
  18. log
  19. --m
  20. 1
  21. --delete-target-dir
  22. --hive-import
  23. --mysql-delimiters
  24. # 执行,查看数据
  25. $ bin/sqoop --options-file opts/mysql_import_time.opt
  26. $ ~/Documents/hive/bin/hive
  27. hive> select * from log;
  28. OK
  29. abcabcabcabc 2020-05-15 12:56:28.0 aaaaaaa
  30. abcabcabcab1 2020-05-20 12:56:32.0 bbbbbbb
  31. abcabcabcab2 2020-05-30 12:56:37.0 ccccccc
  32. Time taken: 0.078 seconds, Fetched: 4 row(s)
  33. hive> dfs -ls /user/hive/warehouse/log;
  34. Found 1 items
  35. -rwxrwxr-x 1 jack supergroup 129 2020-05-14 22:51 /user/hive/warehouse/log/part-m-00000
  36. # --hive-import 不能与 append混用,所以需要指定target-dir
  37. # incremental import模式,
  38. # lastmodified导入类型,以“时间”类型列,来比较增量导入
  39. # 缺少--append或--merge-key,指定数据的导入方式
  40. # --append 追加(文件),新增数据时,追加part-m-00000格式文件
  41. # --merge-key 合并(文件),新增数据时,合并part-m-00000格式文件
  42. # --merge-key or --append is required when using --incremental lastmodified and the output directory exists.
  43. # 报错:Column type is neither timestamp nor date!
  44. # 原因,--check-column指定列,只能是timestamp或date类型
  45. # 列类型错误,修改last-value的值,4 -> "2020-01-02 22:20:38"
  46. # timestamp超过机器的现实时间,也不会读取,追加写入,
  47. # 当前机器时间实际是2020-05-14 22:52:28,那么追加之后的数据就不会成功
  48. # 先前初始化导入的数据没有问题。
  49. # append模式
  50. # 新增数据,理论导出abcabcabcab3,不导入abcabcabcab4,因为机器时间未到(具体查看源码实现,我没看,猜的)
  51. mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES
  52. -> ('abcabcabcab3', '2020-05-07 13:44:38', 'dddddd');
  53. mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES
  54. -> ('abcabcabcab4', '2020-05-31 13:19:58', 'dddddddd');
  55. # 修改sqoop脚本文件,增量导入
  56. $ vi opts/mysql_import_time_incremental.opt
  57. import
  58. --connect
  59. jdbc:mysql://master:3306/mydb
  60. --username
  61. root
  62. --password
  63. 123456
  64. --table
  65. log
  66. --m
  67. 1
  68. --target-dir
  69. /user/hive/warehouse/log
  70. --mysql-delimiters
  71. --incremental
  72. lastmodified
  73. --check-column
  74. create_time
  75. --last-value
  76. "2020-01-14 22:34:53.0"
  77. --append
  78. # 排除已有的,找到已更新的,且合理的(timestamp小于等于实际时间),只有abcabcabcab3。
  79. # 执行sqoop
  80. $ bin/sqoop --options-file opts/mysql_import_time_incremental.opt
  81. # 查看数据,append文件
  82. $ ~/Documents/hive/bin/hive
  83. hive> dfs -ls /user/hive/warehouse/log;
  84. Found 2 items
  85. -rwxrwxr-x 1 jack supergroup 129 2020-05-14 22:51 /user/hive/warehouse/log/part-m-00000
  86. -rw-r--r-- 1 jack supergroup 42 2020-05-14 22:53 /user/hive/warehouse/log/part-m-00001
  87. hive> select * from log;
  88. OK
  89. abcabcabcabc 2020-05-15 12:56:28.0 aaaaaaa
  90. abcabcabcab1 2020-05-20 12:56:32.0 bbbbbbb
  91. abcabcabcab2 2020-05-30 12:56:37.0 ccccccc
  92. abcabcabcab3 2020-05-07 13:44:38.0 dddddd
  93. # merge-key模式
  94. # 该模式下,将整合所有文件为一个文件。
  95. # 新增数据,理论导出 abcabcabcab5 与 abcabcabcab6
  96. mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES
  97. -> ('abcabcabcab5', '2020-05-08 13:44:38', 'dddddd');
  98. mysql> INSERT INTO `log` (`log_id`, `create_time`, `content`) VALUES
  99. -> ('abcabcabcab6', '2020-05-09 13:19:58', 'dddddddd');
  100. # 修改sqoop脚本文件,增量导入
  101. $ vi opts/mysql_import_time_incremental.opt
  102. import
  103. --connect
  104. jdbc:mysql://master:3306/mydb
  105. --username
  106. root
  107. --password
  108. 123456
  109. --table
  110. log
  111. --m
  112. 1
  113. --target-dir
  114. /user/hive/warehouse/log
  115. --mysql-delimiters
  116. --incremental
  117. lastmodified
  118. --check-column
  119. create_time
  120. --last-value
  121. "2020-01-02 22:20:38"
  122. --merge-key
  123. log_id
  124. # 执行sqoop
  125. $ bin/sqoop --options-file opts/mysql_import_time_incremental.opt
  126. # 查看数据,append文件
  127. $ ~/Documents/hive/bin/hive
  128. hive> dfs -ls /user/hive/warehouse/log;
  129. Found 2 items
  130. -rw-r--r-- 1 jack supergroup 0 2020-05-14 22:56 /user/hive/warehouse/log/_SUCCESS
  131. -rw-r--r-- 1 jack supergroup 257 2020-05-14 22:56 /user/hive/warehouse/log/part-r-00000
  132. hive> select * from log;
  133. OK
  134. abcabcabcab1 2020-05-20 12:56:32.0 bbbbbbb
  135. abcabcabcab2 2020-05-30 12:56:37.0 ccccccc
  136. abcabcabcab3 2020-05-07 13:44:38.0 dddddd
  137. abcabcabcab5 2020-05-08 13:44:38.0 dddddd
  138. abcabcabcab6 2020-05-09 13:19:58.0 dddddddd
  139. abcabcabcabc 2020-05-15 12:56:28.0 aaaaaaa
  140. # 重复导入
  141. # 以append模式,再次导入,修改--merge-key为--append
  142. hive> dfs -ls /user/hive/warehouse/log;
  143. Found 3 items
  144. -rw-r--r-- 1 jack supergroup 0 2020-05-14 22:56 /user/hive/warehouse/log/_SUCCESS
  145. -rw-r--r-- 1 jack supergroup 128 2020-05-14 23:17 /user/hive/warehouse/log/part-m-00001
  146. -rw-r--r-- 1 jack supergroup 257 2020-05-14 22:56 /user/hive/warehouse/log/part-r-00000
  147. hive> dfs -cat /user/hive/warehouse/log/*;
  148. abcabcabcab3,2020-05-07 13:44:38.0,dddddd
  149. abcabcabcab5,2020-05-08 13:44:38.0,dddddd
  150. abcabcabcab6,2020-05-09 13:19:58.0,dddddddd
  151. abcabcabcab1,2020-05-20 12:56:32.0,bbbbbbb
  152. abcabcabcab2,2020-05-30 12:56:37.0,ccccccc
  153. abcabcabcab3,2020-05-07 13:44:38.0,dddddd
  154. abcabcabcab5,2020-05-08 13:44:38.0,dddddd
  155. abcabcabcab6,2020-05-09 13:19:58.0,dddddddd
  156. abcabcabcabc,2020-05-15 12:56:28.0,aaaaaaa
  157. # 合并,去除重复
  158. # 以merge模式,再次导入,修改--append为--merge-key
  159. hive> select * from log;
  160. OK
  161. abcabcabcab1 2020-05-20 12:56:32.0 bbbbbbb
  162. abcabcabcab2 2020-05-30 12:56:37.0 ccccccc
  163. abcabcabcab3 2020-05-07 13:44:38.0 dddddd
  164. abcabcabcab5 2020-05-08 13:44:38.0 dddddd
  165. abcabcabcab6 2020-05-09 13:19:58.0 dddddddd
  166. abcabcabcabc 2020-05-15 12:56:28.0 aaaaaaa
  167. Time taken: 0.055 seconds, Fetched: 6 row(s)
  168. hive> dfs -ls /user/hive/warehouse/log;
  169. Found 2 items
  170. -rw-r--r-- 1 jack supergroup 0 2020-05-14 23:23 /user/hive/warehouse/log/_SUCCESS
  171. -rw-r--r-- 1 jack supergroup 257 2020-05-14 23:23 /user/hive/warehouse/log/part-r-00000

sqoop任务

创建job

增量导入hive,之前,需创建hive表,或提前导入部分数据,自动创建hive表。

  1. # 初始化数据
  2. $ mysql -uroot -p123456
  3. DROP TABLE IF EXISTS `user`;
  4. CREATE TABLE `user` (
  5. `id` int(11) NOT NULL,
  6. `name` varchar(20) DEFAULT NULL,
  7. PRIMARY KEY (`id`)
  8. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  9. INSERT INTO `user` VALUES ('1', 'jack');
  10. INSERT INTO `user` VALUES ('2', 'tom');
  11. INSERT INTO `user` VALUES ('3', 'white');
  12. INSERT INTO `user` VALUES ('4', 'black');
  13. $ bin/sqoop job \
  14. --create user_import_incr \
  15. --meta-connect jdbc.mysql://master:3306/sqoop \
  16. --import \
  17. --connect jdbc:mysql://node03:3306/userdb \
  18. --username root --password 123456 \
  19. --table emp \
  20. --incremental append \
  21. --check-column id \
  22. --last-value 1202 \
  23. --target-dir /sqoop/increment/emp \
  24. -m 1

参考

未开启hive元数据服务,数据加载到hive时失败

  1. # 启动hive元数据服务,配置元数据服务时才需要
  2. $ ~/Documents/hive/bin/hive --service metastore
  3. # hive元数据服务未启动,导入失败,删除第一步的临时存放目录
  4. $ ~/Documents/hadoop/bin/hdfs dfs -rm -r -f /user/jack/user
  5. # 手动加载临时数据文件到hive表中