蚂蚁大数据开发22届暑假实习一二面经

作者：超威蓝喵（校招版
链接：https://www.nowcoder.com/discuss/630356?source_id=discuss_experience_nctrack&channel=-1
来源：牛客网

一面
java
数据结构熟悉吗栈和队列区别，两个队列实现栈
ArrayList LinkedList介绍下，区别，应用场景
简历写了GC，讲一下
并发用过吗（没

hive
hive数仓分层讲一下，分层方法论（？），分层的好处（？）
hive SQL运行机制，sql变mr，mr提交到yarn，分配资源，运行map，shuflle，reduce，输出结果到hdfs
hive优化：讲了某个reduce任务很重，一直占用资源，讲了下hive负载均衡；更换mr为spark，给yarn开capacity队列，这个问题应该是问数据倾斜，我因为数据太少了就没敢说，讲的是负载的方面
问了个一个city字段一个count，group by数据倾斜，怎么办
count多的city，用where查询分开统计。。。
大小表join
其他
项目数据哪儿来的
java使用程度，写过demo 没做过工程
意见：多思考，多去研究为什么这样做
笔试加试
算法：剑指offer53
sql：
用户行为表tracking_log，大概字段有（user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’）
需求：
1、计算每天的访客数和他们的平均操作次数。
2、统计每天符合以下条件的用户数：A操作之后是B操作，AB操作必须相邻。

答案：
1.

1
2
3
4
select date(log_time),count(distinct user_id) as user_num,avg(num_ci) as avg_operqationcount
from
(select date(log_time),user_id,count(opr_id) as num_ci from tracking_log group by date(log_time),user_id)
group by date(log_time)
1
2
3
4

我写的

select T.log_time, count(distinct T.user_id) as ‘访客数’, count(T.opr_id)/count(distinct T.user_id) as ‘平均操作次数’
from tracking_log as T
group by T.log_time
2.

1
2
3
4
5
6

没写出来

select date(log_time),count(distinct user_id) as user_num
from
(select user_id,date(log_time),opr_id,lead(opr_id,1) over(partition by user_id order by lod_time) as opr_id_2 from tracking_log)
where opr_id=’A’ and opr_id_2=’B’
group by date(log_time)
二面
自我介绍
论文介绍
项目介绍
项目为什么用flume+kafka？
hive数仓分层，每一层都干了什么

二面面试官说过了，过了一天挂了，说绩点低了hr不要

大数据笔记

20210401-03

我写的

没写出来