Sqoop MySQL 导入数据到 HDFS -- cnDBA.cn

在之前的博客里我们看了Sqoop的安装配置，如下：

Sqoop 安装配置
https://www.cndba.cn/dave/article/3290

本篇我们重点了解一下Sqoop如何导入数据。我们这里使用MySQL数据库进行测试。即从MySQL数据库导入到HDFS中。

1 准备测试环境

安装MySQL数据库，并从官网安装示例数据库，下载地址如下：
https://dev.mysql.com/doc/index-other.html

C:/Users/zhixin/Desktop/test_db-master>mysql -u root -p < employees_partitioned.sql
Enter password: ****
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries
data_load_time_diff
00:01:56

C:/Users/zhixin/Desktop/test_db-master>mysql -u root -p < test_employees_md5.sql
Enter password: ****
INFO
TESTING INSTALLATION
table_name      expected_records        expected_crc
employees       300024  4ec56ab5ba37218d187cf6ab09ce1aa1
departments     9       d1af5e170d2d1591d776d5638d71fc5f
dept_manager    24      8720e2f0853ac9096b689c14664f847e
dept_emp        331603  ccf6fe516f990bdaa49713fc478701b7
titles  443308  bfa016c472df68e70a03facafa1bc0a8
salaries        2844047 fd220654e95aea1b169624ffe3fca934
table_name      found_records           found_crc
employees       300024  4ec56ab5ba37218d187cf6ab09ce1aa1
departments     9       d1af5e170d2d1591d776d5638d71fc5f
dept_manager    24      8720e2f0853ac9096b689c14664f847e
dept_emp        331603  ccf6fe516f990bdaa49713fc478701b7
titles  443308  bfa016c472df68e70a03facafa1bc0a8
salaries        2844047 fd220654e95aea1b169624ffe3fca934
table_name      records_match   crc_match
employees       OK      ok
departments     OK      ok
dept_manager    OK      ok
dept_emp        OK      ok
titles  OK      ok
salaries        OK      ok
computation_time
00:00:16
summary result
CRC     OK
count   OK

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| cndba              |
| employees          |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
6 rows in set (0.00 sec)

mysql> use employees
Database changed
mysql> show tables
    -> ;
+----------------------+
| Tables_in_employees  |
+----------------------+
| current_dept_emp     |
| departments          |
| dept_emp             |
| dept_emp_latest_date |
| dept_manager         |
| employees            |
| salaries             |
| titles               |
+----------------------+
8 rows in set (0.00 sec)

2 Sqoop 命令帮助

可以使用help 查看sqoop命令的使用方法如下：

[dave@www.cndba.cn lib]$ sqoop help
2019-03-03 00:36:24,638 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

如果想查看某个命令的具体使用方法，可以进一步查看，如下：
[dave@www.cndba.cn lib]$ sqoop help import

3 导入表

根据前面帮助的查看，Sqoop工具的’import’命令可以从表中导入表数据到Hadoop的文件系统作为文本文件或二进制文件。

下面的命令用于从MySQL数据库服务器中的employees表导入HDFS。

[dave@www.cndba.cn lib]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 1 -P
Warning: /home/hadoop/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2019-03-03 00:42:42,688 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Enter password: 
2019-03-03 00:42:44,686 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-03 00:42:44,686 INFO tool.CodeGenTool: Beginning code generation
Sun Mar 03 00:42:45 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2019-03-03 00:42:45,216 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2019-03-03 00:42:45,285 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2019-03-03 00:42:45,326 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop
Note: /tmp/sqoop-hadoop/compile/623ca13a40b7cfe612b7162b6a5c50bd/employees.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-03 00:42:47,865 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/623ca13a40b7cfe612b7162b6a5c50bd/employees.jar
2019-03-03 00:42:47,885 WARN manager.MySQLManager: It looks like you are importing from mysql.
2019-03-03 00:42:47,885 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2019-03-03 00:42:47,885 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2019-03-03 00:42:47,885 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2019-03-03 00:42:47,892 INFO mapreduce.ImportJobBase: Beginning import of employees
2019-03-03 00:42:47,893 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-03 00:42:48,128 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-03 00:42:48,189 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-03 00:42:49,206 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-03 00:42:49,344 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.56.100:8032
2019-03-03 00:42:50,076 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1551513927152_0001
Sun Mar 03 00:42:55 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2019-03-03 00:42:55,817 INFO db.DBInputFormat: Using read commited transaction isolation
2019-03-03 00:42:56,410 INFO mapreduce.JobSubmitter: number of splits:1
2019-03-03 00:42:56,454 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-03-03 00:42:56,683 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551513927152_0001
2019-03-03 00:42:56,685 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-03 00:42:57,010 INFO conf.Configuration: resource-types.xml not found
2019-03-03 00:42:57,010 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-03 00:42:57,541 INFO impl.YarnClientImpl: Submitted application application_1551513927152_0001
2019-03-03 00:42:57,599 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1551513927152_0001/
2019-03-03 00:42:57,600 INFO mapreduce.Job: Running job: job_1551513927152_0001
2019-03-03 00:43:12,879 INFO mapreduce.Job: Job job_1551513927152_0001 running in uber mode : false
2019-03-03 00:43:12,880 INFO mapreduce.Job:  map 0% reduce 0%
2019-03-03 00:43:34,149 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-03 00:43:34,169 INFO mapreduce.Job: Job job_1551513927152_0001 completed successfully
2019-03-03 00:43:34,353 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=230472
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=13821993
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=17970
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=17970
        Total vcore-milliseconds taken by all map tasks=17970
        Total megabyte-milliseconds taken by all map tasks=18401280
    Map-Reduce Framework
        Map input records=300024
        Map output records=300024
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=129
        CPU time spent (ms)=4010
        Physical memory (bytes) snapshot=151552000
        Virtual memory (bytes) snapshot=2764148736
        Total committed heap usage (bytes)=121700352
        Peak Map Physical memory (bytes)=151552000
        Peak Map Virtual memory (bytes)=2764148736
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=13821993
2019-03-03 00:43:34,372 INFO mapreduce.ImportJobBase: Transferred 13.1817 MB in 45.1494 seconds (298.9637 KB/sec)
2019-03-03 00:43:34,384 INFO mapreduce.ImportJobBase: Retrieved 300024 records.
[dave@www.cndba.cn lib]$

验证HDFS中导入的数据：

[dave@www.cndba.cn lib]$ hdfs dfs -ls -R /
2019-03-03 00:53:22,011 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwx------   - hadoop supergroup          0 2019-03-03 00:42 /tmp
drwx------   - hadoop supergroup          0 2019-03-03 00:42 /tmp/hadoop-yarn
drwx------   - hadoop supergroup          0 2019-03-03 00:43 /tmp/hadoop-yarn/staging
drwx------   - hadoop supergroup          0 2019-03-03 00:42 /tmp/hadoop-yarn/staging/hadoop
drwx------   - hadoop supergroup          0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/hadoop/.staging
drwxr-xr-x   - hadoop supergroup          0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history
drwxrwxrwt   - hadoop supergroup          0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxrwx---   - hadoop supergroup          0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop
-rwxrwx---   2 hadoop supergroup      15590 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001-1551544977097-hadoop-employees.jar-1551545011876-1-0-SUCCEEDED-default-1551544990417.jhist
-rwxrwx---   2 hadoop supergroup        430 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001.summary
-rwxrwx---   2 hadoop supergroup     227448 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001_conf.xml
drwxr-xr-x   - hadoop supergroup          0 2019-03-03 00:43 /user
drwxr-xr-x   - hadoop supergroup          0 2019-03-03 00:43 /user/hadoop
drwxr-xr-x   - hadoop supergroup          0 2019-03-03 00:43 /user/hadoop/employees
-rw-r--r--   2 hadoop supergroup          0 2019-03-03 00:43 /user/hadoop/employees/_SUCCESS
-rw-r--r--   2 hadoop supergroup   13821993 2019-03-03 00:43 /user/hadoop/employees/part-m-00000
[dave@www.cndba.cn lib]$ hdfs dfs -cat /user/hadoop/employees/part-m-*|more
2019-03-03 00:53:27,913 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
…

4 导入到目标目录

使用Sqoop导入表数据到HDFS时还可以指定目标目录。
以下是指定目标目录选项的Sqoop导入命令的语法。
—target-dir

Hadoop 目录的管理可以参考如下博客：

Hadoop HDFS 常用命令汇总
https://www.cndba.cn/dave/article/3258

导入数据到目录：

[dave@www.cndba.cn ~]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 2 --target-dir /dave -P
…
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=26526
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=26526
        Total vcore-milliseconds taken by all map tasks=26526
        Total megabyte-milliseconds taken by all map tasks=27162624
    Map-Reduce Framework
        Map input records=300024
        Map output records=300024
        Input split bytes=232
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=194
        CPU time spent (ms)=6490
        Physical memory (bytes) snapshot=308015104
        Virtual memory (bytes) snapshot=5528297472
        Total committed heap usage (bytes)=243400704
        Peak Map Physical memory (bytes)=155586560
        Peak Map Virtual memory (bytes)=2764148736
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=13821993
2019-03-03 01:11:09,948 INFO mapreduce.ImportJobBase: Transferred 13.1817 MB in 40.7114 seconds (331.5546 KB/sec)
2019-03-03 01:11:09,954 INFO mapreduce.ImportJobBase: Retrieved 300024 records.

注意这里的目录必须是不存在的，否则会报如下错误：
Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://Master:9000/cndba already exists

查看HDFS文件：

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /dave
-rw-r--r--   2 hadoop supergroup          0 2019-03-03 01:11 /dave/_SUCCESS
-rw-r--r--   2 hadoop supergroup    7098552 2019-03-03 01:11 /dave/part-m-00000
-rw-r--r--   2 hadoop supergroup    6723441 2019-03-03 01:10 /dave/part-m-00001
[dave@www.cndba.cn ~]$
[dave@www.cndba.cn ~]$ hdfs dfs -cat /dave/part*|more
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
…

5 表数据导入子集

使用Sqoop导入工具导入表数据时，可以通过”where”进行过滤，即导入表的部分数据。

where子句的语法如下：
—where

下面的命令用来导入employees表数据中工号小于10014的员工：

[dave@www.cndba.cn ~]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 3  --where "emp_no<10014" --target-dir /subdave -P

验证数据：

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /subdave
-rw-r--r--   2 hadoop supergroup          0 2019-03-03 01:21 /subdave/_SUCCESS
-rw-r--r--   2 hadoop supergroup        182 2019-03-03 01:21 /subdave/part-m-00000
-rw-r--r--   2 hadoop supergroup        186 2019-03-03 01:21 /subdave/part-m-00001
-rw-r--r--   2 hadoop supergroup        228 2019-03-03 01:21 /subdave/part-m-00002
[dave@www.cndba.cn ~]$ hdfs dfs -cat /subdave/part*
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
10011,1953-11-07,Mary,Sluis,F,1990-01-22
10012,1960-10-04,Patricio,Bridgland,M,1992-12-18
10013,1963-06-07,Eberhardt,Terkki,M,1985-10-20
[dave@www.cndba.cn ~]$

6 增量导入

增量导入只导入表中新添加的记录。它需要添加‘incremental’, ‘check-column’和 ‘last-value’选项。

帮助中对这3个选择的说明如下：

Incremental import arguments:
   --check-column <column>      Source column to check for incremental change
   --incremental <import-type>    Define an incremental import of type  'append' or 'lastmodified'
   --last-value <value>           Last imported value in the incremental  check column

在上小节我们导入了工号小于10014的员工，这里我们把后面的员工信息追加上来。命令如下：

[dave@www.cndba.cn data]$  sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 3 --incremental append --check-column emp_no --last-value 10014 --target-dir /subdave -P
…
2019-03-03 01:31:14,179 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/070c46a0c36f7854895b347470cb6ace/employees.jar
2019-03-03 01:31:14,198 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(`emp_no`) FROM `employees`
2019-03-03 01:31:14,199 INFO tool.ImportTool: Incremental import based on column `emp_no`
2019-03-03 01:31:14,200 INFO tool.ImportTool: Lower bound value: 10014
2019-03-03 01:31:14,200 INFO tool.ImportTool: Upper bound value: 499999
…

验证数据：

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /subdave
-rw-r--r--   2 hadoop supergroup          0 2019-03-03 01:21 /subdave/_SUCCESS
-rw-r--r--   2 hadoop supergroup        182 2019-03-03 01:21 /subdave/part-m-00000
-rw-r--r--   2 hadoop supergroup        186 2019-03-03 01:21 /subdave/part-m-00001
-rw-r--r--   2 hadoop supergroup        228 2019-03-03 01:21 /subdave/part-m-00002
-rw-r--r--   2 hadoop supergroup    4547403 2019-03-03 01:31 /subdave/part-m-00003
-rw-r--r--   2 hadoop supergroup    4636921 2019-03-03 01:31 /subdave/part-m-00004
-rw-r--r--   2 hadoop supergroup    4637031 2019-03-03 01:31 /subdave/part-m-00005
[dave@www.cndba.cn ~]$ hdfs dfs -cat /subdave/part*|wc -l
300023
[dave@www.cndba.cn ~]$

签到成功

CNDBA社区