利用hadoop自带基准测试工具包进行集群性能测试,测试平台为CDH6.3.2上hadoop3.0.0版本
目录 /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
使用TestDFSIO、mrbench、nnbench、Terasort 、sort 几个使用较广的基准测试程序
不带参数运行,会显示示例说明:

  1. sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
  1. An example program must be given as the first argument.
  2. Valid program names are:
  3. DFSCIOTest: Distributed i/o benchmark of libhdfs.
  4. DistributedFSCheck: Distributed checkup of the file system consistency.
  5. JHLogAnalyzer: Job History Log analyzer.
  6. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
  7. SliveTest: HDFS Stress Test and Live Data Verification.
  8. TestDFSIO: Distributed i/o benchmark.
  9. fail: a job that always fails
  10. filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
  11. largesorter: Large-Sort tester
  12. loadgen: Generic map/reduce load generator
  13. mapredtest: A map/reduce test check.
  14. minicluster: Single process HDFS and MR cluster.
  15. mrbench: A map/reduce benchmark that can create many small jobs
  16. nnbench: A benchmark that stresses the namenode.
  17. sleep: A job that sleeps at each map and reduce task.
  18. testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
  19. testfilesystem: A test for FileSystem read/write.
  20. testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
  21. testsequencefile: A test for flat files of binary key value pairs.
  22. testsequencefileinputformat: A test for sequence file input format.
  23. testtextinputformat: A test for text input format.
  24. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

1、TestDFSIO

TestDFSIO用于测试HDFS的IO性能,使用一个MapReduce作业来并发地执行读写操作,每个map任务用于读或写每个文件,map的输出用于收集与处理文件相关的统计信息,reduce用于累积统计信息,并产生summary。
查看说明:

  1. hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO
  1. TestDFSIO.1.7
  2. Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes]

1.1 测试HDFS写性能

测试内容:
向HDFS集群写10个128M的文件:

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -size 128MB -resFile /tmp/TestDFSIO_results.log

image.png
查看结果:

  1. cat /tmp/TestDFSIO_results.log

image.png
结论:
向HDFS集群写10个128MB大小的文件,吞吐量为:62.81MB/s,平均IO速率为:67.92MB/s,IO速率std偏差:26.18,总用时:17.42s。

1.2 测试HDFS读性能

测试内容:
读取HDFS集群10个128M的文件

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -size 128MB -resFile /tmp/TestDFSIO_results.log

image.png
查看结果:

  1. cat /tmp/TestDFSIO_results.log

image.png

结论:
读取HDFS集群10个128M的文件吞吐量为:148.68MB/s,平均IO速率为:258.85MB/s,IO速率std偏差367.78,总用时14.33s。

3. 清除测试数据

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean
  1. 19/06/27 13:57:21 INFO fs.TestDFSIO: TestDFSIO.1.7
  2. 19/06/27 13:57:21 INFO fs.TestDFSIO: nrFiles = 1
  3. 19/06/27 13:57:21 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
  4. 19/06/27 13:57:21 INFO fs.TestDFSIO: bufferSize = 1000000
  5. 19/06/27 13:57:21 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
  6. 19/06/27 13:57:22 INFO fs.TestDFSIO: Cleaning up test files

image.png

2、nnbench

nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。
查看说明:

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar nnbench -help
  1. NameNode Benchmark 0.4
  2. Usage: nnbench <options>
  3. Options:
  4. -operation <Available operations are create_write open_read rename delete. This option is mandatory>
  5. * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations.
  6. -maps <number of maps. default is 1. This is not mandatory>
  7. -reduces <number of reduces. default is 1. This is not mandatory>
  8. -startTime <time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time>. default is launch time + 2 mins. This is not mandatory
  9. -blockSize <Block size in bytes. default is 1. This is not mandatory>
  10. -bytesToWrite <Bytes to write. default is 0. This is not mandatory>
  11. -bytesPerChecksum <Bytes per checksum for the files. default is 1. This is not mandatory>
  12. -numberOfFiles <number of files to create. default is 1. This is not mandatory>
  13. -replicationFactorPerFile <Replication factor for the files. default is 1. This is not mandatory>
  14. -baseDir <base DFS path. default is /becnhmarks/NNBench. This is not mandatory>
  15. -readFileAfterOpen <true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory>
  16. -help: Display the help statement

测试内容:
测试使用10个mapper和5个reducer来创建1000个文件:

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar nnbench -operation create_write -maps 10 -reduces 5 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname`

image.png
查看HDFS上存储的结果:

  1. sudo -u hdfs hdfs dfs -cat /benchmarks/NNBench-BIGDATA01/output/*

image.png
结论:
测试使用10个mapper和5个reducer来创建1000个文件:

3.、mrbench

mrbench会多次重复执行一个小作业,用于检查在机群上小作业的运行是否可重复以及运行是否高效。
查看说明:

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench -help
  1. MRBenchmark.0.0.2
  2. Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>] [-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>] [-numRuns <number of times to run the job, default is 1>] [-maps <number of maps for each run, default is 2>] [-reduces <number of reduces for each run, default is 1>] [-inputLines <number of input lines to generate, default is 1>] [-inputType <type of input to generate, one of ascending (default), descending, random>] [-verbose]

测试内容:
测试运行一个作业50次:

  1. sudo -uhdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar mrbench -numRuns 50 -maps 10 -reduces 5 -inputLines 10 -inputType descending

image.png
结论:
沉余执行50次输入10行的map10,reduces5任务,平均用时14.784s,小作业的运行正常且可高效重复运行。

  1. hadoop-mapreduce/hadoop-mapreduce-examples.jar
  1. An example program must be given as the first argument.
  2. Valid program names are:
  3. aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  4. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  5. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  6. dbcount: An example job that count the pageview counts from a database.
  7. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  8. grep: A map/reduce program that counts the matches of a regex in the input.
  9. join: A job that effects a join over sorted, equally partitioned datasets
  10. multifilewc: A job that counts words from several files.
  11. pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  12. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  13. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  14. randomwriter: A map/reduce program that writes 10GB of random data per node.
  15. secondarysort: An example defining a secondary sort to the reduce.
  16. sort: A map/reduce program that sorts the data written by the random writer.
  17. sudoku: A sudoku solver.
  18. teragen: Generate data for the terasort
  19. terasort: Run the terasort
  20. teravalidate: Checking results of terasort
  21. wordcount: A map/reduce program that counts the words in the input files.
  22. wordmean: A map/reduce program that counts the average length of the words in the input files.
  23. wordmedian: A map/reduce program that counts the median length of the words in the input files.
  24. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

4、Terasort

Terasort是测试Hadoop的一个有效的排序程序。通过Hadoop自带的Terasort排序程序,测试不同的Map任务和Reduce任务数量,对Hadoop性能的影响。 实验数据由程序中的teragen程序生成,数量为1G和10G。
一个TeraSort测试需要按三步:

  1. TeraGen生成随机数据
  2. TeraSort对数据排序
  3. TeraValidate来验证TeraSort输出的数据是否有序,如果检测到问题,将乱序的key输出到目录

4.1 TeraGen生成随机数,将结果输出到目录/tmp/examples/terasort-intput

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. teragen 10000000 /tmp/examples/terasort-input

HDFS上的数据
4.2 TeraSort排序,将结果输出到目录/tmp/examples/terasort-output

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. terasort /tmp/examples/terasort-input /tmp/examples/terasort-output

HDFS上的数据
4.3TeraValidate验证,如果检测到问题,将乱序的key输出到目录/tmp/examples/terasort-validate

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. teravalidate /tmp/examples/terasort-output /tmp/examples/terasort-validate

HDFS上的结果

5、另外,常使用的还有sort程序评测MapReduce

5.1 randomWriter产生随机数,每个节点运行10个Map任务,每个Map产生大约1G大小的二进制随机数

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. randomwriter /tmp/examples/random-data

5.2 sort排序

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. sort /tmp/examples/random-data /tmp/examples/sorted-data

5.3 testmapredsort验证数据是否真正排好序了

  1. sudo -uhdfs hadoop jar \
  2. /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
  3. testmapredsort \
  4. -sortInput /tmp/examples/random-data \
  5. -sortOutput /tmp/examples/sorted-data