首頁 > 軟體

執行Hadoop自帶的wordcount單詞統計程式

2020-06-16 17:08:42

0.前言

    前面一篇《Hadoop初體驗:快速搭建Hadoop偽分散式環境》搭建了一個Hadoop的環境,現在就使用Hadoop自帶的wordcount程式來做單詞統計的案例。

http://www.linuxidc.com/Linux/2017-09/146694.htm

1.使用範例程式實現單詞統計

(1)wordcount程式

wordcount程式在hadoop的share目錄下,如下:

[root@linuxidc mapreduce]# pwd
/usr/local/hadoop/share/hadoop/mapreduce
[root@linuxidc mapreduce]# ls
hadoop-mapreduce-client-app-2.6.5.jar        hadoop-mapreduce-client-jobclient-2.6.5-tests.jar
hadoop-mapreduce-client-common-2.6.5.jar      hadoop-mapreduce-client-shuffle-2.6.5.jar
hadoop-mapreduce-client-core-2.6.5.jar        hadoop-mapreduce-examples-2.6.5.jar
hadoop-mapreduce-client-hs-2.6.5.jar          lib
hadoop-mapreduce-client-hs-plugins-2.6.5.jar  lib-examples
hadoop-mapreduce-client-jobclient-2.6.5.jar  sources

就是這個hadoop-mapreduce-examples-2.6.5.jar程式。
 
(2)建立HDFS資料目錄
    建立一個目錄,用於儲存MapReduce任務的輸入檔案:

[root@linuxidc ~]# hadoop fs -mkdir -p /data/wordcount

    建立一個目錄,用於儲存MapReduce任務的輸出檔案:

[root@linuxidc ~]# hadoop fs -mkdir /output

    檢視剛剛建立的兩個目錄:

[root@linuxidc ~]# hadoop fs -ls /
drwxr-xr-x  - root supergroup          0 2017-09-01 20:34 /data
drwxr-xr-x  - root supergroup          0 2017-09-01 20:35 /output

(3)建立一個單詞檔案,並上傳到HDFS
    建立的單詞檔案如下:

 [root@linuxidc ~]# cat myword.txt 
linuxidc yyh
yyh xplinuxidc
katy ling
yeyonghao linuxidc
xpleaf katy

    上傳該檔案到HDFS中:

[root@linuxidc ~]# hadoop fs -put myword.txt /data/wordcount

    在HDFS中檢視剛剛上傳的檔案及內容:

[root@linuxidc ~]# hadoop fs -ls /data/wordcount
-rw-r--r--  1 root supergroup        57 2017-09-01 20:40 /data/wordcount/myword.txt
[root@linuxidc ~]# hadoop fs -cat /data/wordcount/myword.txt
linuxidc yyh
yyh xplinuxidc
katy ling
yeyonghao linuxidc
xpleaf katy

(4)執行wordcount程式
    執行如下命令:

[root@linuxidc ~]# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar wordcount /data/wordcount /output/wordcount
...
17/09/01 20:48:14 INFO mapreduce.Job: Job job_local1719603087_0001 completed successfully
17/09/01 20:48:14 INFO mapreduce.Job: Counters: 38
        File System Counters
                FILE: Number of bytes read=585940
                FILE: Number of bytes written=1099502
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=114
                HDFS: Number of bytes written=48
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Map-Reduce Framework
                Map input records=5
                Map output records=10
                Map output bytes=97
                Map output materialized bytes=78
                Input split bytes=112
                Combine input records=10
                Combine output records=6
                Reduce input groups=6
                Reduce shuffle bytes=78
                Reduce input records=6
                Reduce output records=6
                Spilled Records=12
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=92
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=241049600
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=57
        File Output Format Counters 
                Bytes Written=48

(5)檢視統計結果
    如下:

[root@linuxidc ~]# hadoop fs -cat /output/wordcount/part-r-00000
katy    2
linuxidc    2
ling    1
xplinuxidc  2
yeyonghao      1
yyh    2


IT145.com E-mail:sddin#qq.com