一、HDFS数据格式详解
1、文件格式
2、压缩格式
3、文件格式的使用
以设置为zip格式为例
通过shell命令改动,添加参数设置模板:yarn jar jar_path main_class_path -Dk1=v1参数列表 <in> <out>具体应用:yarn jar TlHadoopCore-jar-with-dependencies.jar \com.tianliangedu.examples.WordCountV2 \-Dmapred.output.compress=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \/tmp/tianliangedu/input /tmp/tianliangedu/output19
Driver类下加入代码
//参数解析器GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);String[] remainingArgs = optionParser.getRemainingArgs();
二、自定义partition
1、自定义reduce数量
yarn jar TlHadoopCore-jar-with-dependencies.jar \com.tianliangedu.examples.WordCountV2 \-Dmapred.output.compress=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \-Dmapred.reduce.tasks=2 \/tmp/tianliangedu/input /tmp/tianliangedu/output38
2、自定义partition实现
/**自定义Partition的定义*/public static class MyHashPartitioner<K, V> extends Partitioner<K, V> {/** Use {@link Object#hashCode()} to partition. */public int getPartition(K key, V value, int numReduceTasks) {return (key.toString().charAt(0) < 'q' ? 0 : 1) % numReduceTasks;// return key.toString().charAt(0);}}
