1 Job、map、reduce过程细节
    [

    ](https://www.cnblogs.com/qiu-hua/p/15154897.html)
    image.png

    image.png

    所以group时不要基于placeid分区

    https://www.cnblogs.com/qiu-hua/p/15154897.html

    为了支持文件的复制,Hadoop提供了一个类DistributedCache,使用该类的方法如下:
    (1)用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件,它的参数是文件的URI(如果是HDFS上的文件,可以这样:hdfs://namenode:9000/home/XXX/file,其中9000是自己配置的NameNode端口号)。JobTracker在作业启动之前会获取这个URI列表,并将相应的文件拷贝到各个TaskTracker的本地磁盘上。(2)用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录,并使用标准的文件读写API读取相应的文件。

    在map阶段,把所有记录标记成的形式,其中key是id

    2 存储的orc等文件是压缩后64M?小文件更细

    3 spark strutructured streaming实现的保存state细节

    4
    ???不仅exist可以么
    Semi Join(类似我的车的一个值找出来单独过滤)

    5
    数据类型怎么影响mpreduce,都是读取后转换?

    6 uptime为空 placeid为一个值原因
    image.png

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nostrick;
    set hive.vectorized.execution.enabled = true;
    set hive.vectorized.execution.reduce.enabled = true;
    set hive.optimize.index.filter=true
    set hive.enforce.bucketing=true;
    set hive.auto.convert.sortmerge.join=true;
    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;
    set hive.auto.convert.sortmerge.join.noconditionaltask=true;
    CREATE TABLE if not exists dwd_test05.dwd_predeal_imsirecord_t1(
    imsi bigint,
    uptime bigint)
    PARTITIONED BY (
    day string, placeid string)
    clustered by(uptime) sorted by(uptime) into 2 buckets
    ROW FORMAT SERDE
    “org.apache.hadoop.hive.ql.io.orc.OrcSerde”
    STORED AS INPUTFORMAT
    “org.apache.hadoop.hive.ql.io.orc.OrcInputFormat”
    OUTPUTFORMAT
    “org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat”;

    with count_filter as
    (
    select
    hour,
    placeid,
    imsi,
    count(1) as cn
    from
    (
    select
    from_unixtime(uptime,”HH”) as hour,
    imsi,
    placeid
    from
    dwd_test05.dwd_imsirecord
    where day=”20210814”
    and placeid in(select placeid from dim_test05.dim_mdb_camera
    where cameratype=1
    group by placeid)
    )t
    group by hour,placeid,imsi
    having cn > 1000
    order by cn desc
    ),
    process as
    (
    select
    from_unixtime(uptime,”HH”) as hour
    ,placeid
    ,imsi
    ,uptime
    from
    dwd_test05.dwd_imsirecord
    where day=”20210814”
    and placeid in(select placeid from dim_test05.dim_mdb_camera
    where cameratype=1
    group by placeid)
    )
    insert overwrite table dwd_test05.dwd_predeal_imsirecord_t1 partition(day=”20210814”, placeid)
    select
    CAST(y.imsi AS BIGINT) as imsi,
    substr(y.placeid,22) as placeid,
    y.uptime as uptime
    from
    (
    select
    x.imsi,
    x.uptime,
    x.placeid,
    x.slice,
    row_number() over (partition by x.imsi,x.slice,x.placeid order by x.uptime desc ) rank
    from
    (
    select
    t1.imsi
    ,t1.placeid
    ,t1.uptime
    ,floor(t1.uptime/360) as slice
    from process t1
    left join count_filter t2
    on ( t1.hour=t2.hour and t1.placeid=t2.placeid and t1.imsi=t2.imsi)
    where t2.imsi is null
    ) x
    group by x.imsi,x.uptime,x.placeid,x.slice
    distribute by x.uptime
    sort by x.uptime
    ) y
    where y.rank=1;

    为啥就不能又这个字段
    y.day as day,
    x.day,

    9 不是9的字符能强行转int么?转int比较加快怎么搞