MapReduce输出统计量实例

一个mapreduce执行完成后,控制台会输出一些执行过程中产生的数据,通过分析这些数据可以帮助我们验证执行过程是否正常。

举一个实际的例子:有一个文本文件共有1亿行,每行是一个取值范围0~1亿的整数,文件大小约837MB。现在使用mapreduce抽取这个文件中最小的100个数。

采用的策略是,在map阶段,维护一个尺寸为100的最小堆,将每个split中最小的100个数放在堆内,输出格式是<数值,出现次数>,结束时按顺序输出堆里的100个数字。在reduce阶段,由于mapreduce框架已经自动将map的输出按key(即数值)排序,所以直接输出前100个数值即可(每个数值输出次数是map阶段的value)。

由于所使用环境HDFS的block size是128MB,因此可以预期到这个837MB的文件会被分割为7个split(每个split约120MB)分别进行map操作。

提交并运行这个mapreduce程序,在控制台里最后可以看到下面的输出,我添加了一些注释内容:

File System Counters
 FILE: Number of bytes read=50429
 FILE: Number of bytes written=2373560
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=4574262068
 HDFS: Number of bytes written=2584
 HDFS: Number of read operations=97
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=10
Map-Reduce Framework
 
 //总输入记录数,即文件里的1亿个数字
 Map input records=100000000
 
 //任务被分成7个split,每个map输出100条记录,所以map阶段总共输出700条记录
 Map output records=700

 //map阶段输出的字节数(平均每条记录12字节,<IntWritable,LongWritable>)
 Map output bytes=8400
 Map output materialized bytes=9842
 Input split bytes=784
 Combine input records=0
 Combine output records=0
 
 //Reducer的输入一共有多少个unique的key
 Reduce input groups=346
 Reduce shuffle bytes=9842

 //Reducer的输入一共有多少条记录
 Reduce input records=700

 //Reducer一共输出了多少条记录
 Reduce output records=700
 Spilled Records=1400
 Shuffled Maps =7
 Failed Shuffles=0
 Merged Map outputs=7
 GC time elapsed (ms)=1564
 Total committed heap usage (bytes)=3536846848
Shuffle Errors
 BAD_ID=0
 CONNECTION=0
 IO_ERROR=0
 WRONG_LENGTH=0
 WRONG_MAP=0
 WRONG_REDUCE=0
File Input Format Counters
 Bytes Read=877801882
File Output Format Counters
 Bytes Written=2584

同样程序执行三次,分别耗时3分48秒、3分12秒、5分25秒。

若将文件行数增加到20亿(耗时53分钟),控制台输出的统计量也相应发生变化:

 File System Counters
 FILE: Number of bytes read=48724250
 FILE: Number of bytes written=63469529
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=1499621751504
 HDFS: Number of bytes written=71836
 HDFS: Number of read operations=22798
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=151
 Map-Reduce Framework
 Map input records=2000000000
 Map output records=14800
 Map output bytes=177600
 Map output materialized bytes=208088
 Input split bytes=16280
 Combine input records=0
 Combine output records=0
 Reduce input groups=6809
 Reduce shuffle bytes=208088
 Reduce input records=14800
 Reduce output records=14800
 Spilled Records=29600
 Shuffled Maps =148
 Failed Shuffles=0
 Merged Map outputs=148
 GC time elapsed (ms)=19059
 Total committed heap usage (bytes)=73469526016
 Shuffle Errors
 BAD_ID=0
 CONNECTION=0
 IO_ERROR=0
 WRONG_LENGTH=0
 WRONG_MAP=0
 WRONG_REDUCE=0
 File Input Format Counters
 Bytes Read=19778375016
 File Output Format Counters
 Bytes Written=71836

 

发表评论