在之前的博客里我们看了Sqoop的安装配置,如下:
Sqoop 安装配置
https://www.cndba.cn/dave/article/3290
本篇我们重点了解一下Sqoop如何导入数据。 我们这里使用MySQL数据库进行测试。即从MySQL数据库导入到HDFS中。
1 准备测试环境
安装MySQL数据库,并从官网安装示例数据库,下载地址如下:
https://dev.mysql.com/doc/index-other.html
我们这里使用的是employee data 。Github上安装说明。
C:/Users/zhixin/Desktop/test_db-master>mysql -u root -p < employees_partitioned.sql
Enter password: ****
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries
data_load_time_diff
00:01:56
C:/Users/zhixin/Desktop/test_db-master>mysql -u root -p < test_employees_md5.sql
Enter password: ****
INFO
TESTING INSTALLATION
table_name expected_records expected_crc
employees 300024 4ec56ab5ba37218d187cf6ab09ce1aa1
departments 9 d1af5e170d2d1591d776d5638d71fc5f
dept_manager 24 8720e2f0853ac9096b689c14664f847e
dept_emp 331603 ccf6fe516f990bdaa49713fc478701b7
titles 443308 bfa016c472df68e70a03facafa1bc0a8
salaries 2844047 fd220654e95aea1b169624ffe3fca934
table_name found_records found_crc
employees 300024 4ec56ab5ba37218d187cf6ab09ce1aa1
departments 9 d1af5e170d2d1591d776d5638d71fc5f
dept_manager 24 8720e2f0853ac9096b689c14664f847e
dept_emp 331603 ccf6fe516f990bdaa49713fc478701b7
titles 443308 bfa016c472df68e70a03facafa1bc0a8
salaries 2844047 fd220654e95aea1b169624ffe3fca934
table_name records_match crc_match
employees OK ok
departments OK ok
dept_manager OK ok
dept_emp OK ok
titles OK ok
salaries OK ok
computation_time
00:00:16
summary result
CRC OK
count OK
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| cndba |
| employees |
| mysql |
| performance_schema |
| sys |
+--------------------+
6 rows in set (0.00 sec)
mysql> use employees
Database changed
mysql> show tables
-> ;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| salaries |
| titles |
+----------------------+
8 rows in set (0.00 sec)
2 Sqoop 命令帮助
可以使用help 查看sqoop命令的使用方法如下:
[dave@www.cndba.cn lib]$ sqoop help
2019-03-03 00:36:24,638 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
如果想查看某个命令的具体使用方法,可以进一步查看,如下:
[dave@www.cndba.cn lib]$ sqoop help import
3 导入表
根据前面帮助的查看,Sqoop工具的’import’命令可以从表中导入表数据到Hadoop的文件系统作为文本文件或二进制文件。
下面的命令用于从MySQL数据库服务器中的employees表导入HDFS。
[dave@www.cndba.cn lib]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 1 -P
Warning: /home/hadoop/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2019-03-03 00:42:42,688 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Enter password:
2019-03-03 00:42:44,686 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-03 00:42:44,686 INFO tool.CodeGenTool: Beginning code generation
Sun Mar 03 00:42:45 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2019-03-03 00:42:45,216 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2019-03-03 00:42:45,285 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2019-03-03 00:42:45,326 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop
Note: /tmp/sqoop-hadoop/compile/623ca13a40b7cfe612b7162b6a5c50bd/employees.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-03 00:42:47,865 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/623ca13a40b7cfe612b7162b6a5c50bd/employees.jar
2019-03-03 00:42:47,885 WARN manager.MySQLManager: It looks like you are importing from mysql.
2019-03-03 00:42:47,885 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2019-03-03 00:42:47,885 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2019-03-03 00:42:47,885 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2019-03-03 00:42:47,892 INFO mapreduce.ImportJobBase: Beginning import of employees
2019-03-03 00:42:47,893 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-03 00:42:48,128 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-03 00:42:48,189 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-03 00:42:49,206 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-03 00:42:49,344 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.56.100:8032
2019-03-03 00:42:50,076 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1551513927152_0001
Sun Mar 03 00:42:55 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
2019-03-03 00:42:55,817 INFO db.DBInputFormat: Using read commited transaction isolation
2019-03-03 00:42:56,410 INFO mapreduce.JobSubmitter: number of splits:1
2019-03-03 00:42:56,454 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-03-03 00:42:56,683 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551513927152_0001
2019-03-03 00:42:56,685 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-03 00:42:57,010 INFO conf.Configuration: resource-types.xml not found
2019-03-03 00:42:57,010 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-03 00:42:57,541 INFO impl.YarnClientImpl: Submitted application application_1551513927152_0001
2019-03-03 00:42:57,599 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1551513927152_0001/
2019-03-03 00:42:57,600 INFO mapreduce.Job: Running job: job_1551513927152_0001
2019-03-03 00:43:12,879 INFO mapreduce.Job: Job job_1551513927152_0001 running in uber mode : false
2019-03-03 00:43:12,880 INFO mapreduce.Job: map 0% reduce 0%
2019-03-03 00:43:34,149 INFO mapreduce.Job: map 100% reduce 0%
2019-03-03 00:43:34,169 INFO mapreduce.Job: Job job_1551513927152_0001 completed successfully
2019-03-03 00:43:34,353 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=230472
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=13821993
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=17970
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=17970
Total vcore-milliseconds taken by all map tasks=17970
Total megabyte-milliseconds taken by all map tasks=18401280
Map-Reduce Framework
Map input records=300024
Map output records=300024
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=129
CPU time spent (ms)=4010
Physical memory (bytes) snapshot=151552000
Virtual memory (bytes) snapshot=2764148736
Total committed heap usage (bytes)=121700352
Peak Map Physical memory (bytes)=151552000
Peak Map Virtual memory (bytes)=2764148736
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=13821993
2019-03-03 00:43:34,372 INFO mapreduce.ImportJobBase: Transferred 13.1817 MB in 45.1494 seconds (298.9637 KB/sec)
2019-03-03 00:43:34,384 INFO mapreduce.ImportJobBase: Retrieved 300024 records.
[dave@www.cndba.cn lib]$
验证HDFS中导入的数据:
[dave@www.cndba.cn lib]$ hdfs dfs -ls -R /
2019-03-03 00:53:22,011 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwx------ - hadoop supergroup 0 2019-03-03 00:42 /tmp
drwx------ - hadoop supergroup 0 2019-03-03 00:42 /tmp/hadoop-yarn
drwx------ - hadoop supergroup 0 2019-03-03 00:43 /tmp/hadoop-yarn/staging
drwx------ - hadoop supergroup 0 2019-03-03 00:42 /tmp/hadoop-yarn/staging/hadoop
drwx------ - hadoop supergroup 0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/hadoop/.staging
drwxr-xr-x - hadoop supergroup 0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - hadoop supergroup 0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxrwx--- - hadoop supergroup 0 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop
-rwxrwx--- 2 hadoop supergroup 15590 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001-1551544977097-hadoop-employees.jar-1551545011876-1-0-SUCCEEDED-default-1551544990417.jhist
-rwxrwx--- 2 hadoop supergroup 430 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001.summary
-rwxrwx--- 2 hadoop supergroup 227448 2019-03-03 00:43 /tmp/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1551513927152_0001_conf.xml
drwxr-xr-x - hadoop supergroup 0 2019-03-03 00:43 /user
drwxr-xr-x - hadoop supergroup 0 2019-03-03 00:43 /user/hadoop
drwxr-xr-x - hadoop supergroup 0 2019-03-03 00:43 /user/hadoop/employees
-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 00:43 /user/hadoop/employees/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 13821993 2019-03-03 00:43 /user/hadoop/employees/part-m-00000
[dave@www.cndba.cn lib]$ hdfs dfs -cat /user/hadoop/employees/part-m-*|more
2019-03-03 00:53:27,913 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
…
4 导入到目标目录
使用Sqoop导入表数据到HDFS时还可以指定目标目录。
以下是指定目标目录选项的Sqoop导入命令的语法。
—target-dir
Hadoop 目录的管理可以参考如下博客:
Hadoop HDFS 常用命令 汇总
https://www.cndba.cn/dave/article/3258
导入数据到目录:
[dave@www.cndba.cn ~]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 2 --target-dir /dave -P
…
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=26526
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=26526
Total vcore-milliseconds taken by all map tasks=26526
Total megabyte-milliseconds taken by all map tasks=27162624
Map-Reduce Framework
Map input records=300024
Map output records=300024
Input split bytes=232
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=194
CPU time spent (ms)=6490
Physical memory (bytes) snapshot=308015104
Virtual memory (bytes) snapshot=5528297472
Total committed heap usage (bytes)=243400704
Peak Map Physical memory (bytes)=155586560
Peak Map Virtual memory (bytes)=2764148736
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=13821993
2019-03-03 01:11:09,948 INFO mapreduce.ImportJobBase: Transferred 13.1817 MB in 40.7114 seconds (331.5546 KB/sec)
2019-03-03 01:11:09,954 INFO mapreduce.ImportJobBase: Retrieved 300024 records.
注意这里的目录必须是不存在的,否则会报如下错误:
Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://Master:9000/cndba already exists
查看HDFS文件:
[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /dave
-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 01:11 /dave/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 7098552 2019-03-03 01:11 /dave/part-m-00000
-rw-r--r-- 2 hadoop supergroup 6723441 2019-03-03 01:10 /dave/part-m-00001
[dave@www.cndba.cn ~]$
[dave@www.cndba.cn ~]$ hdfs dfs -cat /dave/part*|more
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
…
5 表数据导入子集
使用Sqoop导入工具导入表数据时,可以通过”where”进行过滤,即导入表的部分数据。
where子句的语法如下:
—where
下面的命令用来导入employees表数据中工号小于10014的员工:
[dave@www.cndba.cn ~]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 3 --where "emp_no<10014" --target-dir /subdave -P
验证数据:
[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /subdave
-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 01:21 /subdave/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 182 2019-03-03 01:21 /subdave/part-m-00000
-rw-r--r-- 2 hadoop supergroup 186 2019-03-03 01:21 /subdave/part-m-00001
-rw-r--r-- 2 hadoop supergroup 228 2019-03-03 01:21 /subdave/part-m-00002
[dave@www.cndba.cn ~]$ hdfs dfs -cat /subdave/part*
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
10011,1953-11-07,Mary,Sluis,F,1990-01-22
10012,1960-10-04,Patricio,Bridgland,M,1992-12-18
10013,1963-06-07,Eberhardt,Terkki,M,1985-10-20
[dave@www.cndba.cn ~]$
6 增量导入
增量导入只导入表中新添加的记录。它需要添加‘incremental’, ‘check-column’和 ‘last-value’选项。
帮助中对这3个选择的说明如下:
Incremental import arguments:
--check-column <column> Source column to check for incremental change
--incremental <import-type> Define an incremental import of type 'append' or 'lastmodified'
--last-value <value> Last imported value in the incremental check column
在上小节我们导入了工号小于10014的员工,这里我们把后面的员工信息追加上来。命令如下:
[dave@www.cndba.cn data]$ sqoop import --connect jdbc:mysql://192.168.56.2:3306/employees --username root --table employees --m 3 --incremental append --check-column emp_no --last-value 10014 --target-dir /subdave -P
…
2019-03-03 01:31:14,179 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/070c46a0c36f7854895b347470cb6ace/employees.jar
2019-03-03 01:31:14,198 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(`emp_no`) FROM `employees`
2019-03-03 01:31:14,199 INFO tool.ImportTool: Incremental import based on column `emp_no`
2019-03-03 01:31:14,200 INFO tool.ImportTool: Lower bound value: 10014
2019-03-03 01:31:14,200 INFO tool.ImportTool: Upper bound value: 499999
…
验证数据:
[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /subdave
-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 01:21 /subdave/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 182 2019-03-03 01:21 /subdave/part-m-00000
-rw-r--r-- 2 hadoop supergroup 186 2019-03-03 01:21 /subdave/part-m-00001
-rw-r--r-- 2 hadoop supergroup 228 2019-03-03 01:21 /subdave/part-m-00002
-rw-r--r-- 2 hadoop supergroup 4547403 2019-03-03 01:31 /subdave/part-m-00003
-rw-r--r-- 2 hadoop supergroup 4636921 2019-03-03 01:31 /subdave/part-m-00004
-rw-r--r-- 2 hadoop supergroup 4637031 2019-03-03 01:31 /subdave/part-m-00005
[dave@www.cndba.cn ~]$ hdfs dfs -cat /subdave/part*|wc -l
300023
[dave@www.cndba.cn ~]$
版权声明:本文为博主原创文章,未经博主允许不得转载。