question_list - 《大数据》

1 Job、map、reduce过程细节
[

](https://www.cnblogs.com/qiu-hua/p/15154897.html)

所以group时不要基于placeid分区

https://www.cnblogs.com/qiu-hua/p/15154897.html

为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：
（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://namenode:9000/home/XXX/file，其中9000是自己配置的NameNode端口号）。JobTracker在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个TaskTracker的本地磁盘上。（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

在map阶段，把所有记录标记成的形式，其中key是id

2 存储的orc等文件是压缩后64M？小文件更细

3 spark strutructured streaming实现的保存state细节

4
???不仅exist可以么
Semi Join（类似我的车的一个值找出来单独过滤）

5
数据类型怎么影响mpreduce，都是读取后转换？

6 uptime为空 placeid为一个值原因

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.optimize.index.filter=true
set hive.enforce.bucketing=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
CREATE TABLE if not exists dwd_test05.dwd_predeal_imsirecord_t1(
imsi bigint,
uptime bigint)
PARTITIONED BY (
day string, placeid string)
clustered by(uptime) sorted by(uptime) into 2 buckets
ROW FORMAT SERDE
“org.apache.hadoop.hive.ql.io.orc.OrcSerde”
STORED AS INPUTFORMAT
“org.apache.hadoop.hive.ql.io.orc.OrcInputFormat”
OUTPUTFORMAT
“org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat”;

with count_filter as
(
select
hour,
placeid,
imsi,
count(1) as cn
from
(
select
from_unixtime(uptime,”HH”) as hour,
imsi,
placeid
from
dwd_test05.dwd_imsirecord
where day=”20210814”
and placeid in(select placeid from dim_test05.dim_mdb_camera
where cameratype=1
group by placeid)
)t
group by hour,placeid,imsi
having cn > 1000
order by cn desc
),
process as
(
select
from_unixtime(uptime,”HH”) as hour
,placeid
,imsi
,uptime
from
dwd_test05.dwd_imsirecord
where day=”20210814”
and placeid in(select placeid from dim_test05.dim_mdb_camera
where cameratype=1
group by placeid)
)
insert overwrite table dwd_test05.dwd_predeal_imsirecord_t1 partition(day=”20210814”, placeid)
select
CAST(y.imsi AS BIGINT) as imsi,
substr(y.placeid,22) as placeid,
y.uptime as uptime
from
(
select
x.imsi,
x.uptime,
x.placeid,
x.slice,
row_number() over (partition by x.imsi,x.slice,x.placeid order by x.uptime desc ) rank
from
(
select
t1.imsi
,t1.placeid
,t1.uptime
,floor(t1.uptime/360) as slice
from process t1
left join count_filter t2
on ( t1.hour=t2.hour and t1.placeid=t2.placeid and t1.imsi=t2.imsi)
where t2.imsi is null
) x
group by x.imsi,x.uptime,x.placeid,x.slice
distribute by x.uptime
sort by x.uptime
) y
where y.rank=1;

为啥就不能又这个字段
y.day as day,
x.day,

9 不是9的字符能强行转int么？转int比较加快怎么搞