1. Hadoop HDFS

Hadoop HDFS 分布式文件系统

1.1 常用命令

  1. hadoop fs -mkdir # 创建 HDFS 目录
  2. hadoop fs -ls # 列出 HDFS 目录
  3. hadoop fs -copyFromLocal # 本地复制
  4. hadoop fs -put # 复制到 HDFS
  5. hadoop fs -cat # 列出 HDFS 下文件
  6. hadoop fs -copyToLocal # 复制到本地
  7. hadoop fs -get # 复制到本地
  8. hadoop fs -cp # 复制
  9. hadoop fs -rm # 删除

1.2 目录

  1. # 创建 user 目录
  2. hadoop fs -mkdir /user
  3. # user 下创建 test 子目录
  4. hadoop fs -mkdir /user/test
  5. # 删除目录
  6. hadoop fs -rm -r /user/test
  7. # 查看 HDFS 目录 / 下所有目录
  8. hadoop fs -ls /
  9. # 查看 HDFS 目录/ 下所有内容
  10. hadoop fs -ls -r /

1.3 文件

  1. # 复制本地文件到 HDFS 目录,参数 -f 强制复制
  2. hadoop fs -copyFromLocal test.txt /user/test
  3. hadoop fs -put test.txt /user/test/test.txt
  4. # 显示文件内容
  5. hadoop fs -cat /user/test/test.txt
  6. hadoop fs -cat /user/test/test.txt|more
  7. # HDFS 文件复制到本地
  8. hadoop fs -copyToLocal /user/test/test.txt
  9. # 复制文件到本地
  10. hadoop fs -get /user/test/test.txt
  11. hadoop fs -cp /user/test/test.txt /user/test/temp.txt
  12. # 删除文件
  13. hadoop fs -rm /user/test/test.txt

2. Hadoop MapReduce

通过 MapReduce 来使大量服务器并行处理,Map 进行数据切割,Reduce 进行数据合并

2.1 创建 WordCount.java

  1. 创建目录
  1. mkdir -p ~/wordcount/input
  2. cd ~/wordcount
  1. 访问 MapReduce Tutorial 官方文档 , 复制 WordCount.java 源码
  1. sudo vim WordCount.java
  2. import java.io.IOException;
  3. import java.util.StringTokenizer;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.fs.Path;
  6. import org.apache.hadoop.io.IntWritable;
  7. import org.apache.hadoop.io.Text;
  8. import org.apache.hadoop.mapreduce.Job;
  9. import org.apache.hadoop.mapreduce.Mapper;
  10. import org.apache.hadoop.mapreduce.Reducer;
  11. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  12. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  13. public class WordCount {
  14. public static class TokenizerMapper
  15. extends Mapper<Object, Text, Text, IntWritable>{
  16. private final static IntWritable one = new IntWritable(1);
  17. private Text word = new Text();
  18. public void map(Object key, Text value, Context context
  19. ) throws IOException, InterruptedException {
  20. StringTokenizer itr = new StringTokenizer(value.toString());
  21. while (itr.hasMoreTokens()) {
  22. word.set(itr.nextToken());
  23. context.write(word, one);
  24. }
  25. }
  26. }
  27. public static class IntSumReducer
  28. extends Reducer<Text,IntWritable,Text,IntWritable> {
  29. private IntWritable result = new IntWritable();
  30. public void reduce(Text key, Iterable<IntWritable> values,
  31. Context context
  32. ) throws IOException, InterruptedException {
  33. int sum = 0;
  34. for (IntWritable val : values) {
  35. sum += val.get();
  36. }
  37. result.set(sum);
  38. context.write(key, result);
  39. }
  40. }
  41. public static void main(String[] args) throws Exception {
  42. Configuration conf = new Configuration();
  43. Job job = Job.getInstance(conf, "word count");
  44. job.setJarByClass(WordCount.class);
  45. job.setMapperClass(TokenizerMapper.class);
  46. job.setCombinerClass(IntSumReducer.class);
  47. job.setReducerClass(IntSumReducer.class);
  48. job.setOutputKeyClass(Text.class);
  49. job.setOutputValueClass(IntWritable.class);
  50. FileInputFormat.addInputPath(job, new Path(args[0]));
  51. FileOutputFormat.setOutputPath(job, new Path(args[1]));
  52. System.exit(job.waitForCompletion(true) ? 0 : 1);
  53. }
  54. }
  1. 编译 WordCount.java

参考 Compiling Hadoop Code

  1. # 编译
  2. javac -classpath `${HADOOP_HOME}/bin/hadoop classpath` WordCount.java
  3. # 打包
  4. jar cf wordcount.jar WordCount*.class
  5. # 查看
  6. ll

2.2 测试

  1. 使用 Hadoop 下 LICENSE.txt 进行测试
  1. cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input
  1. HDFS 上传 LICENSE.txt
  1. hadoop fs -mkdir -p /wordcount/input
  2. cd ~/wordcount/input
  3. hadoop fs -copyFromLocal LICENSE.txt /wordcount/input
  1. 运行 WordCount.java
  1. cd ~/wordcount
  2. hadoop jar wordcount.jar WordCount /wordcount/input/LICENSE.txt /wordcount/output/

出现报错,检查 <HADOOP_HOME>/etc/hadoop/mapred-site.xml 是否有如下配置

  1. <HADOOP_HOME>/etc/hadoop/mapred-site.xml contains the below configuration:
  2. <property>
  3. <name>yarn.app.mapreduce.am.env</name>
  4. <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  5. </property>
  6. <property>
  7. <name>mapreduce.map.env</name>
  8. <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  9. </property>
  10. <property>
  11. <name>mapreduce.reduce.env</name>
  12. <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  13. </property>
  1. 查看运行结果
  • 查看 HDFS 目录 hadoop fs -ls /wordcount/output, _SUCCESS 表示运行成功,part-r-00000 为生成结果
  1. -rw-r--r-- 3 root supergroup 0 2022-02-07 20:22 /wordcount/output/_SUCCESS
  2. -rw-r--r-- 3 root supergroup 9894 2022-02-07 20:22 /wordcount/output/part-r-00000
  • HDFS 查看 hadoop fs -cat /wordcount/output/part-r-00000|more
  1. 注意事项:
    • 再次执行 WordCount, 需删除输出目录,以免出现报错信息