Hadoop 的计算需要通过MapReduced来实现,可以通过编写Java程序,将功能打成jar包来执行。所以如果有良好的Java基础,编写MR程序自然会容易很多。
我们这里用自带的示例程序来运行wordcount,从而来演示Hadoop的功能。
/home/cndba/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar
导入测试文件:
[http://www.cndba.cn@hadoopmaster hadoop]$ ls
bin dfs etc include lib libexec LICENSE.txt logs NOTICE.txt README.txt sbin share tmp var
[http://www.cndba.cn@hadoopmaster hadoop]$ hdfs dfs -put LICENSE.txt /dave
[http://www.cndba.cn@hadoopmaster hadoop]$
[http://www.cndba.cn@hadoopmaster hadoop]$ hdfs dfs -ls /
Found 3 items
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:16 /dave
drwxr-xr-x - cndba supergroup 0 2019-01-23 21:33 /oracle
drwxr-xr-x - cndba supergroup 0 2019-01-23 22:36 /system
[http://www.cndba.cn@hadoopmaster hadoop]$ hdfs dfs -ls -R /
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:16 /dave
-rw-r--r-- 2 cndba supergroup 147144 2019-01-23 23:16 /dave/LICENSE.txt
-rw-r--r-- 2 cndba supergroup 0 2019-01-23 21:51 /dave/www.cndba.cn.txt
drwxr-xr-x - cndba supergroup 0 2019-01-23 21:33 /oracle
drwxr-xr-x - cndba supergroup 0 2019-01-23 21:33 /oracle/mysql
drwxr-xr-x - cndba supergroup 0 2019-01-23 22:36 /system
示例的jar包在如下目录:
[http://www.cndba.cn@hadoopmaster mapreduce]$ pwd
/home/cndba/hadoop/share/hadoop/mapreduce
执行Hadoop MR程序:
[http://www.cndba.cn@hadoopmaster mapreduce]$ hadoop jar hadoop-mapreduce-examples-3.1.1.jar wordcount /dave/LICENSE.txt output
2019-01-23 23:55:14,527 INFO client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
2019-01-23 23:55:14,944 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0003
2019-01-23 23:55:15,344 INFO input.FileInputFormat: Total input files to process : 1
2019-01-23 23:55:15,461 INFO mapreduce.JobSubmitter: number of splits:1
2019-01-23 23:55:15,538 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-01-23 23:55:15,749 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1548242934753_0003
2019-01-23 23:55:15,751 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-01-23 23:55:15,967 INFO conf.Configuration: resource-types.xml not found
2019-01-23 23:55:15,967 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-01-23 23:55:16,050 INFO impl.YarnClientImpl: Submitted application application_1548242934753_0003
2019-01-23 23:55:16,106 INFO mapreduce.Job: The url to track the job: http://hadoopmaster:8088/proxy/application_1548242934753_0003/
2019-01-23 23:55:16,107 INFO mapreduce.Job: Running job: job_1548242934753_0003
2019-01-23 23:55:23,242 INFO mapreduce.Job: Job job_1548242934753_0003 running in uber mode : false
2019-01-23 23:55:23,244 INFO mapreduce.Job: map 0% reduce 0%
2019-01-23 23:55:28,328 INFO mapreduce.Job: map 100% reduce 0%
2019-01-23 23:55:34,369 INFO mapreduce.Job: map 100% reduce 100%
2019-01-23 23:55:34,380 INFO mapreduce.Job: Job job_1548242934753_0003 completed successfully
2019-01-23 23:55:34,524 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=46271
FILE: Number of bytes written=521743
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=147250
HDFS: Number of bytes written=34795
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3346
Total time spent by all reduces in occupied slots (ms)=3103
Total time spent by all map tasks (ms)=3346
Total time spent by all reduce tasks (ms)=3103
Total vcore-milliseconds taken by all map tasks=3346
Total vcore-milliseconds taken by all reduce tasks=3103
Total megabyte-milliseconds taken by all map tasks=3426304
Total megabyte-milliseconds taken by all reduce tasks=3177472
Map-Reduce Framework
Map input records=2746
Map output records=21463
Map output bytes=228869
Map output materialized bytes=46271
Input split bytes=106
Combine input records=21463
Combine output records=2965
Reduce input groups=2965
Reduce shuffle bytes=46271
Reduce input records=2965
Reduce output records=2965
Spilled Records=5930
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=102
CPU time spent (ms)=2300
Physical memory (bytes) snapshot=518160384
Virtual memory (bytes) snapshot=5637390336
Total committed heap usage (bytes)=431489024
Peak Map Physical memory (bytes)=314851328
Peak Map Virtual memory (bytes)=2815950848
Peak Reduce Physical memory (bytes)=203309056
Peak Reduce Virtual memory (bytes)=2821439488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=147144
File Output Format Counters
Bytes Written=34795
[http://www.cndba.cn@hadoopmaster mapreduce]$
注意这里可能会遇到如下错误:
Hadoop 3.1.1 Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster 解决方法
https://www.cndba.cn/dave/article/3259
输出的结果在HDFS的output目录下:
[http://www.cndba.cn@hadoopmaster mapreduce]$ hdfs dfs -ls -R /
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:16 /dave
-rw-r--r-- 2 cndba supergroup 147144 2019-01-23 23:16 /dave/LICENSE.txt
-rw-r--r-- 2 cndba supergroup 0 2019-01-23 21:51 /dave/www.cndba.cn.txt
drwxr-xr-x - cndba supergroup 0 2019-01-23 21:33 /oracle
drwxr-xr-x - cndba supergroup 0 2019-01-23 21:33 /oracle/mysql
drwxr-xr-x - cndba supergroup 0 2019-01-23 22:36 /system
drwx------ - cndba supergroup 0 2019-01-23 23:25 /tmp
drwx------ - cndba supergroup 0 2019-01-23 23:25 /tmp/hadoop-yarn
drwx------ - cndba supergroup 0 2019-01-23 23:39 /tmp/hadoop-yarn/staging
drwx------ - cndba supergroup 0 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba
drwx------ - cndba supergroup 0 2019-01-23 23:55 /tmp/hadoop-yarn/staging/cndba/.staging
drwx------ - cndba supergroup 0 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0001
-rw-r--r-- 10 cndba supergroup 316297 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0001/job.jar
-rw-r--r-- 10 cndba supergroup 113 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0001/job.split
-rw-r--r-- 2 cndba supergroup 42 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0001/job.splitmetainfo
-rw-r--r-- 2 cndba supergroup 182479 2019-01-23 23:25 /tmp/hadoop-yarn/staging/cndba/.staging/job_1548242934753_0001/job.xml
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:39 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - cndba supergroup 0 2019-01-23 23:39 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxrwx--- - cndba supergroup 0 2019-01-23 23:55 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba
-rwxrwx--- 2 cndba supergroup 22444 2019-01-23 23:40 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0002-1548257983080-cndba-word+count-1548258000513-1-1-SUCCEEDED-default-1548257988877.jhist
-rwxrwx--- 2 cndba supergroup 440 2019-01-23 23:40 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0002.summary
-rwxrwx--- 2 cndba supergroup 211968 2019-01-23 23:40 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0002_conf.xml
-rwxrwx--- 2 cndba supergroup 22442 2019-01-23 23:55 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0003-1548258916005-cndba-word+count-1548258932719-1-1-SUCCEEDED-default-1548258921387.jhist
-rwxrwx--- 2 cndba supergroup 440 2019-01-23 23:55 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0003.summary
-rwxrwx--- 2 cndba supergroup 211968 2019-01-23 23:55 /tmp/hadoop-yarn/staging/history/done_intermediate/cndba/job_1548242934753_0003_conf.xml
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:39 /user
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba/output
-rw-r--r-- 2 cndba supergroup 0 2019-01-23 23:55 /user/cndba/output/_SUCCESS
-rw-r--r-- 2 cndba supergroup 34795 2019-01-23 23:55 /user/cndba/output/part-r-00000
[http://www.cndba.cn@hadoopmaster mapreduce]$
[http://www.cndba.cn@hadoopmaster mapreduce]$ hdfs dfs -cat /user/cndba/output/part-r-00000|more
""AS 2
"AS 22
"AS-IS" 1
"Adaptation" 1
"COPYRIGHTS 1
"Collection" 1
"Collective 1
"Contribution" 2
"Contributor" 2
"Creative 1
"Derivative 2
"Distribute" 1
"French 2
"JDOM" 2
"JDOM", 1
"Java 1
"LICENSE"). 2
"Legal 1
"License" 1
"License"); 2
"Licensed 1
"Licensor" 3
"Losses") 1
"NOTICE" 1
"Not 1
"Object" 1
"Original 2
"Program" 1
"Publicly 1
……
注意这里的输出目录不能存在,如果存在会报如下错误:
[http://www.cndba.cn@hadoopmaster mapreduce]$ hadoop jar hadoop-mapreduce-examples-3.1.1.jar wordcount /dave/LICENSE.txt output
2019-01-23 23:42:16,728 INFO client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoopmaster:9000/user/cndba/output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:164)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:280)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
删除对应目录就可以执行了:
[http://www.cndba.cn@hadoopmaster mapreduce]$ hdfs dfs -rm -r output
Deleted output
版权声明:本文为博主原创文章,未经博主允许不得转载。