开始前准备

将airlinedata复制到虚拟机

hdfs dfs -mkdir -p /user/hdfs/airlinedata
hdfs dfs -mkdir -p /user/hdfs/sampledata
hdfs dfs -mkdir -p /user/hdfs/masterdata

cd /home/hadoop/proHadoop/
hdfs dfs -copyFromLocal airlinedata/ /user/hdfs/airlinedata/
hdfs dfs -copyFromLocal sampledata/ /user/hdfs/sampledata/
hdfs dfs -copyFromLocal masterdata/* /user/hdfs/masterdata/

hdfs dfs -ls /user/hdfs/airlinedata

hdfs dfs -ls /user/hdfs/sampledata

hdfs dfs -ls /user/hdfs/masterdata

TestSelect

1.启动eclipse

root用户转hadoop用户
[root@test ~]# su - hadoop
使用hadoop用户运行以下命令：
[hadoop@test ~]$ eclipse
如果出现以下错误：

[hadoop@test ~]$ eclipse
Eclipse: Cannot open display: 
org.eclipse.m2e.logback.configuration: The org.eclipse.m2e.logback.configuration bundle was activated before the state location was initialized.  Will retry after the state location is initialized.
Eclipse: Cannot open display: 
Eclipse:
An error has occurred. See the log file
/opt/eclipse/configuration/1588190319670.log.

上述错误没有解决
直接到/opt/eclipse中点击应用开启

eclipse过了一段时间自己可以运行了，具体原因不明。￣□￣｜｜

2.建立TestSelect项目

3.创建org.apress.prohadoop.c5包和org.apress.prohadoop.utils包

4.将所需的输入文件拷贝过来，并且创建输出文件夹

输入文件位置是：/TestSelect/devairlinedataset/input/txt

将已有的select文件夹删除
输出文件夹是：/TestSelect/output/c5

5.新建SelectClauseMRJob类和AirlineDataUtils类

内容为：

package org.apress.prohadoop.c5;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apress.prohadoop.utils.AirlineDataUtils;


public class SelectClauseMRJob extends Configured implements Tool {
    public static class SelectClauseMapper extends
            Mapper<LongWritable, Text, NullWritable, Text> {
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if (!AirlineDataUtils.isHeader(value)) {
                StringBuilder output = AirlineDataUtils.mergeStringArray(
                        AirlineDataUtils.getSelectResultsPerRow(value), ",");
                context.write(NullWritable.get(), new Text(output.toString()));
            }
        }
    }
    public int run(String[] allArgs) throws Exception {
        Job job = Job.getInstance(getConf());
        job.setJarByClass(SelectClauseMRJob.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(SelectClauseMapper.class);
        job.setNumReduceTasks(0);
        String[] args = new GenericOptionsParser(getConf(), allArgs)
                .getRemainingArgs();
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        boolean status = job.waitForCompletion(true);
        if (status) {
            return 0;
        } else {
            return 1;
        }
    }
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        ToolRunner.run(new SelectClauseMRJob(), args);
    }
}

3.本机运行

输入参数
/home/hadoop/eclipse-workspace/TestSelect/devairlinedataset/input/txt/ /home/hadoop/eclipse-workspace/TestSelect/output/c5/select/

运行成功结束
查看结果
刷新output

4.集群环境运行

将java项目打成jar包

选一个jar包存放的位置，并且起一个名字
存放的位置/home/hadoop
jar名字：TestSelect.jar

运行jar

hdfs dfs -ls /user/hdfs

cd /home/hadoop
hadoop jar TestSelect.jar org.apress.prohadoop.c5.SelectClauseMRJob /user/hdfs/sampledata /user/hdfs/output/c5/select

出现
Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s)
解决：没有启动yarn
start-yarn.sh
hadoop jar TestSelect.jar org.apress.prohadoop.c5.SelectClauseMRJob /user/hdfs/sampledata /user/hdfs/output/c5/select

运行结果
hdfs dfs -ls /user/hdfs/output/c5/select

hdfs dfs -cat /user/hdfs/output/c5/select/part-m-00000

hdfs dfs -cat /user/hdfs/output/c5/select/part-m-00001

TestWhere

在TestSelect基础上将WhereClauseMRJob.java拷贝过去

打成jar包，过程同TestSelect
jar包命名TestWhere.jar
存在/home/hadoop下

运行
hadoop jar TestWhere.jar org.apress.prohadoop.c5.WhereClauseMRJob -D map.where.delay=10 /user/hdfs/sampledata /user/hdfs/output/c5/where

TestSUM

hadoop jar TestSUM.jar org.apress.prohadoop.c5.AggregationMRJob /user/hdfs/sampledata /user/hdfs/output/c5/aggregation

TestSUMWithCombiner

hadoop jar TestSUMWithCombiner.jar org.apress.prohadoop.c5.AggregationWithCombinerMRJob /user/hdfs/sampledata /user/hdfs/output/c5/combineraggregation

TestSplitByMonth

hadoop jar TestSplitByMonth.jar org.apress.prohadoop.c5.SplitByMonthMRJob /user/hdfs/sampledata /user/hdfs/output/c5/partitioner

大数据

MapReduce编程（航空大数据）