1 Job、map、reduce过程细节
[
](https://www.cnblogs.com/qiu-hua/p/15154897.html)

所以group时不要基于placeid分区
https://www.cnblogs.com/qiu-hua/p/15154897.html
为了支持文件的复制,Hadoop提供了一个类DistributedCache,使用该类的方法如下:
(1)用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件,它的参数是文件的URI(如果是HDFS上的文件,可以这样:hdfs://namenode:9000/home/XXX/file,其中9000是自己配置的NameNode端口号)。JobTracker在作业启动之前会获取这个URI列表,并将相应的文件拷贝到各个TaskTracker的本地磁盘上。(2)用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录,并使用标准的文件读写API读取相应的文件。
在map阶段,把所有记录标记成
2 存储的orc等文件是压缩后64M?小文件更细
3 spark strutructured streaming实现的保存state细节
4
???不仅exist可以么
Semi Join(类似我的车的一个值找出来单独过滤)
5
数据类型怎么影响mpreduce,都是读取后转换?
6 uptime为空 placeid为一个值原因
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.optimize.index.filter=true
set hive.enforce.bucketing=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
CREATE TABLE if not exists dwd_test05.dwd_predeal_imsirecord_t1(
imsi bigint,
uptime bigint)
PARTITIONED BY (
day string, placeid string)
clustered by(uptime) sorted by(uptime) into 2 buckets
ROW FORMAT SERDE
“org.apache.hadoop.hive.ql.io.orc.OrcSerde”
STORED AS INPUTFORMAT
“org.apache.hadoop.hive.ql.io.orc.OrcInputFormat”
OUTPUTFORMAT
“org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat”;
with count_filter as
(
select
hour,
placeid,
imsi,
count(1) as cn
from
(
select
from_unixtime(uptime,”HH”) as hour,
imsi,
placeid
from
dwd_test05.dwd_imsirecord
where day=”20210814”
and placeid in(select placeid from dim_test05.dim_mdb_camera
where cameratype=1
group by placeid)
)t
group by hour,placeid,imsi
having cn > 1000
order by cn desc
),
process as
(
select
from_unixtime(uptime,”HH”) as hour
,placeid
,imsi
,uptime
from
dwd_test05.dwd_imsirecord
where day=”20210814”
and placeid in(select placeid from dim_test05.dim_mdb_camera
where cameratype=1
group by placeid)
)
insert overwrite table dwd_test05.dwd_predeal_imsirecord_t1 partition(day=”20210814”, placeid)
select
CAST(y.imsi AS BIGINT) as imsi,
substr(y.placeid,22) as placeid,
y.uptime as uptime
from
(
select
x.imsi,
x.uptime,
x.placeid,
x.slice,
row_number() over (partition by x.imsi,x.slice,x.placeid order by x.uptime desc ) rank
from
(
select
t1.imsi
,t1.placeid
,t1.uptime
,floor(t1.uptime/360) as slice
from process t1
left join count_filter t2
on ( t1.hour=t2.hour and t1.placeid=t2.placeid and t1.imsi=t2.imsi)
where t2.imsi is null
) x
group by x.imsi,x.uptime,x.placeid,x.slice
distribute by x.uptime
sort by x.uptime
) y
where y.rank=1;
为啥就不能又这个字段
y.day as day,
x.day,
9 不是9的字符能强行转int么?转int比较加快怎么搞
