hadoop基础使用 - 《大数据相关》

Hadoop常见Shell命令
Hadoop的web验证练习

————————Hadoop常用命令————————
1.将本地的小文件合并，上传到HDFS
hadoop fs -appendToFile 1.txt 2.txt /input/3.txt
2.下载HDFS的小文件到本地，合并成一个大文件
hadoop fs -getmerge /input/.txt local_largefile.txt
3.合并HDFS上的小文件
hadoop fs -cat /input/*.txt | hadoop fs -appendToFile - /input/hdfs_largefile.txt
4.查看文件中损坏的块
hadoop fsck / -list-corruptfileblocks
5.将损坏的文件移动至/lost_found目录（-move）
6.删除损坏的文件（-delete）
7.检查并列出所有文件状态（-files）
8.检查并打印正在被打开执行写操作的文件（-openforwrite）
9.打印文件的block报告（-blocks）需要和-files一起使用
10.打印文件块的位置信息（-locations）
11.打印文件块位置所在的机架信息（-racks）

*————————yarn 常用命令————————
yarn application 查看任务:
（1）列出所有 Application：yarn application -list
（2）根据 Application 状态过滤：yarn application -list -appStates （所有状态：ALL、NEW、NEW_SAVING、SUBMITTED、ACCEPTED、RUNNING、FINISHED、FAILED、KILLED）:yarn application -list -appStates FINISHED
（3）Kill 掉 Application：yarn application -kill application_1612577921195_0001
yarn logs 查看日志:
（1）查询 Application 日志：yarn logs -applicationId
（2）查询 Container 日志：yarn logs -applicationId -containerId
yarn applicationattempt 查看尝试运行的任务:
（1）列出所有 Application 尝试的列表：yarn applicationattempt -list
（2）打印 ApplicationAttemp 状态：yarn applicationattempt -status
yarn container 查看容器:
（1）列出所有 Container：yarn container -list
（2）打印 Container 状态：yarn container -status
yarn node 查看节点状态:
（1）列出所有节点：yarn node -list -all
yarn rmadmin 更新配置：
（1）加载队列配置：yarn rmadmin -refreshQueues
yarn queue 查看队列：
（1）打印队列信息：yarn queue -status

Hadoop常见Shell命令

-ls: 查看指定路径的当前目录结构
-ls -R: 递归查看指定路径的目录结构
-du: 统计目录下文件(夹)的大小
-mkdir: 创建空白文件夹(-P递归创建文件夹)
-rm: 删除文件/空白文件夹
-rmr: 递归删除
-touchz:创建空白文件
-cat: 查看文件内容
-text: 将源文件输出为文本格式
-get: 将Hadoop上某个文件下载到本地已有目录下
-mV: 将Hadoop上某个文件移动
-kill: 将正在运行的hadoop作业kill掉
-du -h: 显示目录下各个文件大小
-du -S: 汇总目录下文件大小
-du -S -h:汇总文件所占存储空间

Hadoop的web验证练习

HDFS启动验证
1、查看jps进程状态:

jps

2、打开浏览器查看
http://192.168.10.111:50070/
http://192.168.10.111:50070/dfshealth.html# tab-overview
ResourceManager状态查看:

1、打开浏览器查看集群状态、日志信息等:
http://192.168.10.111:8088/
http://192.168.10.111:8088/cluster

在从节点上查看NodeManager信息:
http://192.168.10.113:8042/

*————————HDFS常用命令————————
启动Hadoop
start-all.sh停止Hadoop
stop-all.sh
将本地文件上传至HDFS
hadoop fs -put [本地地址] [HDFS目录] 将HDFS文件下载至本地目录
hadoop fs -get [hdfs目录] [本地目录]
查看指定目录下内容
hadoop dfs -ls [文件目录]

打开某个已存在文件
hadoop dfs -cat [file_path]

删除hdfs上指定文件
hadoop fs -rmr [文件或目录]在hdfs指定目录内创建新目录
hadoop fs -mkdir /test
将hdfs上的文件重命名
hadoop fs -mv /test/ok1.txt /test/ ok2.txt将hdfs上的文件夹移动
hadoop fs -mv /workspace/data /test/data离开hadoop的安全模式
hadoop dfsadmin -safemode leave分布式拷贝
hadoop distcp
hdfs://linux1:9000/nebula_datacenter/import/data
hdfs://linux1:9000/workspaces/copytest
hadoop2.X
进程名启动停止
NameNode hadoop-daemon.sh start namenode hadoop-daemon.sh stop namenode
DataNode hadoop-daemon.sh start datanode hadoop-daemon.sh stop datanode
DFSZKFailoverController hadoop-daemon.sh start zkfc hadoop-daemon.sh start zkfc

*————————MapReduce常用命令————————
启动MapReduce
start-mapred.sh停止MapReduce
stop-mapred.sh

启动tasktracker服务进程
hadoop-daemon.sh start tasktracker停止tasktracker服务进程
hadoop-daemon.sh stop tasktracker查看正在运行的MapReduce任务
hadoop job -list

将正在运行的hadoop作业kill掉
hadoop job –kill [job-id]
启动tasktracker服务进程
hadoop-daemon.sh start jobtracker停止tasktracker服务进程
hadoop-daemon.sh stop jobtracker
进程名启动停止
JobTracker hadoop-daemon.sh start jobtracker hadoop-daemon.sh stop jobtracker
TaskTracker hadoop-daemon.sh start tasktracker hadoop-daemon.sh stop tasktracker

*————————Zookeeper常用命令————————
启动Zookeeper服务
zkServer.sh start停止Zookeeper服务
zkServer.sh stop查看Zookeeper状态
zkServer.sh status

Zookeeper客户端连接
zkCli.sh 使用ls查看当前Zookeeper中所包含的内容
ls /

创建一个新的Znode
create /zk
myData
获得得到节点数据
get /zk删除一个Znode节点
delete /zk测试服务是否处于正确状态
echo ruok | nc 127.0.0.1 2181输出服务的状态和统计信息
echo stat | nc 127.0.0.1 2181

输出相关服务配置的详细信息
echo conf | nc 127.0.0.1 2181

*————————Hbase常用命令————————
启动HBase
start-hbase.sh停止HBase
stop-hbase.sh
启动HMaster服务进程
hbase-daemon.sh start master停止HMaster服务进程
hbase-daemon.sh stop master启动HRegionServer服务进程
hbase-daemon.sh start regionserver停止HRegionServer服务进程
hbase-daemon.sh stop regionserver

扫描全表扫描
scan ‘NB_TAB_IM_201401_001.001.001.001’从指定位置扫描指定条数据
scan ‘NB_TAB_IM_201401_001.001.001.001’,{STARTROW=>”D\x180271\x18ID8\x1820140124181727”,LIMIT=>10}

删除表先disable表
disable ‘NB_TAB_IM_201401_001.001.001.001’

再drop表
drop ‘NB_TAB_IM_201401_001.001.001.001’
修改表的TTL查看tablename表的FAMILIES 中有几个区,一般有D和F
describe ‘tablename’

disable HBase表
disable ‘tablename’

修改TTL属性的值
alter ‘tablename’,{NAME=>’D’,TTL=>’时间’}
D为要修改TTL的列族，时间是以秒为单位的

最后enable表，使表可用
enable ‘tablename‘

进程名启动停止
Hmaster hbase-daemon.sh start master hbase-daemon.sh stop master
HRegionServer hbase-daemon.sh start regionserver hbase-daemon.sh stop regionserver

*————————Hive常用命令————————
启动Hive
（1）后台服务方式：nohup hive —service hiveserver 50031
（2）命令行方式：
命令行输入 hive 并回车
判断Hive是否启动
（1）后台服务方式：通过 jps 命令查看 RunJar 进程是否存在

（2）命令行方式：能够正常进入 hive 命令行

进入Hive命令行
输入hive并回车，显示如： hive >

查询所有表
hive > show tables；创建表
hive > create table table_name (name string); 删除表
hive > drop table table_name; 退出客户端
hive > exit ;

查询表中的数据
hive >Select name from table_name ; 执行分析语句
hive > Select count(id) from table_name ; 查询表定义
hive > SHOW CREATE TABLE test;

建表
create table student(id string,name string,birthday string,sex string) row format delimited fields terminated by ‘\t’;
create table course(id string,name string,tid string) row format delimited fields terminated by ‘\t’;

create table teacher(id string,name string) row format delimited fields terminated by ‘\t’;

create table score(sid string,cid string,score int) row format delimited fields terminated by ‘\t’;

数据：
student表：
01 赵雷 1990-01-01 男
02 钱电 1990-12-21 男
03 孙风 1990-05-20 男
04 李云 1990-08-06 男
05 周梅 1991-12-01 女
06 吴兰 1992-03-01 女
07 郑竹 1989-07-01 女
08 王菊 1990-01-20 女
course表：
01 语文 02
02 数学 01
03 英语 03
teacher表：
01 张三
02 李四
03 王五
score表：
01 01 80
01 02 90
01 03 99
02 01 70
02 02 60
02 03 80
03 01 80
03 02 80
03 03 80
04 01 50
04 02 30
04 03 20
05 01 76
05 02 87
06 01 31
06 03 34
07 02 89
07 03 98

导入数据：
load data local inpath ‘/home/yunwei/fuxiaotong_test/student.csv’ into table student;
load data local inpath ‘/home/yunwei/fuxiaotong_test/course.csv’ into table course;
load data local inpath ‘/home/yunwei/fuxiaotong_test/teacher.csv’ into table teacher;
load data local inpath ‘/home/yunwei/fuxiaotong_test/score.csv’ into table score;

SQL语句:
1、查询”01”课程比”02”课程成绩高的学生的信息及课程分数:
select stu.* ,s3.score 01score,s4.score 02score from student stu
join score s3 on stu.id=s3.sid and s3.cid=‘01’
join score s4 on stu.id=s4.sid and s4.cid=‘02’
where stu.id in (select s1.sid from score s1 join score s2 on s1.sid=s2.sid
where s1.cid=‘01’
and s2.cid=‘02’ and s1.score>s2.score);

2、查询所有同学的学生编号、学生姓名、选课总数、所有课程的总成绩:

select stu.id id ,stu.name name,count(1) count,sum(s1.score) sumscore from

student stu left join score s1 on stu.id=s1.sid group by stu.name,stu.id ;

3、查询男生、女生人数

select stu.sex ,count() from student stu group by stu.sex;
报错：UDFArgumentException Argument expected

select sum(if(sex=’男’,1,0)) male,sum(if(sex=’女’,1,0)) female from student;

4、查询名字中含有”风”字的学生信息
select * from student stu where stu.name like “%风%”;
张%

5、查询每门课程被选修的学生数:
select c1.name name,s1.cid cid,count() count from score s1 join course c1 on c1.id=s1.cid group by s1.cid,c1.name;
报错：UDFArgumentException Argument expected

6、查询1990年出生的学生名单:
select stu.* from student stu where substr(stu.birthday,0,4)=’1990’;

7、查询每门课程的平均成绩，结果按平均成绩降序排列，平均成绩相同时，按课程编号升序排列:

select s1.cid cid,round(avg(s1.score),2) as avg from score s1 group by s1.cid order by avg desc ,cid;

8、查询课程编号为01且课程成绩在80分以上的学生的学号和姓名:

select stu.name name,stu.id id from score s1 join student stu on stu.id=s1.sid where s1.cid=’01’and s1.score>80;

9、查询平均成绩大于60分的同学和平均成绩：
select sid,avg(score) avescore from score group by sid having avgscore>60;

10、查询学过01并且也学过编号02课程的同学的学号，姓名：
select stu.id,stu.name from student stu join score s1 on s1.sid=stu.id join sc s2 on s2.sid=stu.id where s1.cid=01 and s2.cid=02;

11、查询所有课程成绩小于60的同学的学号、姓名：
select stu.id,stu.name from student stu left join score sc on stu.id=sc.sid and sc.score>=60 group by stu.id,stu.name having sum(case when sc.sid is null then 0 else 1 end)=0;

12、查询学过‘张三’老师所教的所有课程的同学的学号和姓名：
select stu.id,stu.name from student stu join course join teacher left join score on stu.id=score.sid and course.id=score.cid and teacher.id=course.tid where teacher.name=’张三’ group by stu.id,stu.name having sum(case when score.sid=null then 1 else 0 end)=0;

13、查询课程的最高分和最低分
select cid,max(score) maxscore,min(case when score is null then 0 else score end) minscore from score group by cid;

14、查询学过“数学”，且分数低于60分的同学的名字和分数
select stu.name,score from score join student stu on stu.id=score.sid join course on course.id=score.cid where course.name=’数学’ and score<60;

15、查询任何一门课程在70分以上的姓名、课程和分数。
select stu.name,course.name,score from student stu join score on stu.id=score.id join course on course.id=score.cid where score>70;

16、查询不及格的课程并按课程号从大到小排列
select score.cid,course.name from course join score on score.cid=course.id group by score.cid,course.name order by score.cid desc;

17、查询课程编号为03切成绩在80分以上的学生学号和姓名
select stu.id,stu.name from student stu join score sc on stu.id=sc.sid where sc.score>80 and sc.cid=03;

18、求选课程的学生人数
select count(stu_count.sid) from (select sid from score group by sid) stu_count;

19、查询不同老师所教不同课程平均分从高到底显示、
select cs.tid,avg(score) avgscore from score sc join course cs on sc.cid=cs.id join teacher on teacher.id=cs.tid group by cs.tid,cs.id order by avgscore desc;

20、按平均成绩从高到底显示所有学生的“语文”、“数学”、“英语”三门课成绩
按如下形式显示：学生id 语文数学英语课程数平均分
select sc.sid,max(case course.name when ‘语文’ then sc.score else 0 end) yuwen,max(case course.name when ‘数学’ then sc.score else 0 end) shuxue,max(case course.name when ‘英语’ then sc.score else 0 end) yingyu,count(sc.cid) kechengshu,avg(sc.score) pingjunfen from score sc join course on sc.cid=course.id group by sc.sid order by pingjunfen;

一、Hadoop入门
1、常用端口号
hadoop3.x
HDFS NameNode 内部通常端口：8020/9000/9820
HDFS NameNode 对用户的查询端口：9870
Yarn查看任务运行情况的：8088
历史服务器：19888
hadoop2.x
HDFS NameNode 内部通常端口：8020/9000
HDFS NameNode 对用户的查询端口：50070
Yarn查看任务运行情况的：8088
历史服务器：19888
2、常用的配置文件
3.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml workers
2.x core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml slaves

二、HDFS
1、HDFS文件块大小（面试重点）
硬盘读写速度
在企业中一般128m（中小公司） 256m （大公司）
2、HDFS的Shell操作（开发重点）
3、HDFS的读写流程（面试重点）
三、Map Reduce
1、InputFormat
1）默认的是TextInputformat kv key偏移量，v :一行内容
2）处理小文件CombineTextInputFormat 把多个文件合并到一起统一切片
2、Mapper
setup()初始化； map()用户的业务逻辑； clearup() 关闭资源；
3、分区
默认分区HashPartitioner ，默认按照key的hash值%numreducetask个数
自定义分区
4、排序
1）部分排序每个输出的文件内部有序。
2）全排序：一个reduce ,对所有数据大排序。
3）二次排序：自定义排序范畴，实现 writableCompare接口，重写compareTo方法
总流量倒序按照上行流量正序
5、Combiner
前提：不影响最终的业务逻辑（求和没问题求平均值）
提前聚合map => 解决数据倾斜的一个方法
6、Reducer
用户的业务逻辑；
setup()初始化；reduce()用户的业务逻辑； clearup() 关闭资源；
7、OutputFormat
1）默认TextOutputFormat 按行输出到文件
2）自定义
四、Yarn
1、Yarn的工作机制（面试题）

2、Yarn的调度器
1）FIFO/容量/公平
2）apache 默认调度器容量； CDH默认调度器公平
3）公平/容量默认一个default ，需要创建多队列
4）中小企业：hive spark flink mr

5）中大企业：业务模块：登录/注册/购物车/营销
6）好处：解耦降低风险 11.11 6.18 降级使用
7）每个调度器特点：
相同点：支持多队列，可以借资源，支持多用户
不同点：容量调度器：优先满足先进来的任务执行
公平调度器，在队列里面的任务公平享有队列资源
8）生产环境怎么选：
中小企业，对并发度要求不高，选择容量
中大企业，对并发度要求比较高，选择公平。
3、开发需要重点掌握：
1）队列运行原理

2）Yarn常用命令
3）核心参数配置
4）配置容量调度器和公平调度器。
5）tool接口使用。