Hadoop封装的数据序列化类型

Hadoop在Java数据类型基础上又封装了新的数据类型，这些类都实现了Hadoop的序列化接口Writable：位于org.apache.hadoop.io包下

Java类型	Hadoop Writable类型
String	Text
Boolean	BooleanWritable
Byte	ByteWritable
Int	IntWritable
Float	FloatWritable
Long	LongWtitable
Double	DoubleWritable
Map	MapWritable
Array	ArrayWritable
Null	NullWritable

MapReduce编程规范

用户编写的程序分为三个部分：

Mapper
Reducer
Driver

通俗的讲就是，Mapper阶段将每行数据，设置成key，value的形式。如key：小红 value：1
到了Reducer阶段就会得到相同key归纳后的结果，格式为 key：小红 values：[1,1,1,1,1] ,然后再这一步进行处理，写出最后的结果

Mapper

Mapper代码编写：

用户自定义的Mapper要继承Mapper类，传入的泛型为（<输入数据key的类型, 输入数据value的类型, 输出结果key的类型,输出结果value的类型>）
Mapper的输入数据时Key-Value形式（key、value的类型可以自定义）
用户的实现类中需要重写Mapper类的map()方法，在方法中写业务逻辑
Mapper的输出数据是key-value的形式（key、value的类型可以自定义）
map()方法（MapTask进程）对每一个 key-value 调用一次，对输入的数据文件内容默认按行处理

以官方提供的WordCount示例程序中的Mapper为例：（位于/opt/module/hadoop-3.2.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar中的WordCount

// 继承Mapper类，泛型为<输入数据key的类型, 输入数据value的类型, 输出结果key的类型,输出结果value的类型>
// Text类型即为Hadoop封装的String类型
// IntWritable类型即为Hadoop封装的Int类型
// 输入数据中：key是偏移量，value是具体内容
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    // 实现map()方法
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

Reducer

Reducer编写：

用户自定义的Reducer需要继承Reducer类
Reducer的输入数据类型对应Mapper的输出数据类型，也是 key-value 形式
用户的实现类中需要重写Reducer类的reduce()方法，在方法中写业务逻辑
ReduceTask进程对每一组相同 key 的 key-value 组调用一次reduce()方法 ```java 假如Mapper计算的结果key-value为以下内容： { a:12, b:10 } { a:1, c:1 } { a:2, b:3 }

到了reduce()方法中，就会将相同的key组成一个组，值为一个集合： { a: [12, 1, 2], b: [10, 3], c: [1] }

以官方提供的WordCount实例程序中的Reducer为例：
```java
// 需要继承Reducer类，泛型为：<输入数据key的类型, 输入数据value的类型, 输出结果key的类型,输出结果value的类型>
// Reducer输入数据的 key-value 即为 Mapper 的输出结果 key-value
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    // 需要重写reduce方法
    // key 即为相同key的组的key
    // values 即为该key的value组的集合
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Driver

相当于Yarn集群的客户端，用于提交我们整个程序到Yarn集群，提交的是封装了的MapReduce程序相关运行参数的job对象。

以官方提供的WordCount实例程序中的Reducer为例：

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
        System.err.println("Usage: wordcount <in> [<in>...] <out>");
        System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
        FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
                                   new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

大数据Hadoop

13-MapReduce的Java程序编写格式

Hadoop封装的数据序列化类型

MapReduce编程规范

Mapper

Reducer

Driver