第七周：Python - 7.21 Python Pandas关联 - 《数据分析学习笔记》

df.merge()，基于字段列名连接
df.join()，基于索引连接
pd.concat()，相当于UNION，当字段名一致时会并入一列

案例中的2个表，operator和house，连接的字段为operator.id与house.operator_id
operator.head()
    id        email            create_time            update_time
0    10004    qd3@163.com        2018-02-23 15:01:46    2018-02-23 15:02:28
1    10005    alo@vip.qq.com    2018-01-10 20:56:46    2018-01-11 10:42:22
2    10008    qd3@163.com        2018-02-23 15:01:58    2018-02-23 15:02:38
3    10015    343148@qq.com    2018-01-11 10:46:22    2018-01-11 16:03:43
4    10029    123345@qq.com    2018-02-01 16:51:24    2018-02-01 16:51:24
house.head()
    ID    operator_id
0    32    10004
1    33    10004
2    34    10004
3    35    10004
4    36    10008

df.merge()，基于字段列名连接

#merge,按列连接,连接方式默认是inner
operator.merge(house,how='inner',left_on='id',right_on='operator_id')
    id        email                create_time            update_time            ID        operator_id
0    10004    qdcs1234@163.com    2018-02-23 15:01:46    2018-02-23 15:02:28    32        10004
1    10004    qdcs1233@163.com    2018-02-23 15:01:46    2018-02-23 15:02:28    33        10004
2    10004    qdcs123a@163.com    2018-02-23 15:01:46    2018-02-23 15:02:28    34        10004
3    10004    qdcs123@163.com        2018-02-23 15:01:46    2018-02-23 15:02:28    35        10004
4    10004    qdcs123@163.com        2018-02-23 15:01:46    2018-02-23 15:02:28    39        10004
...    ...    ...    ...    ...    ...    ...
2953    10265    asl@123.com        2019-05-20 14:47:35    2019-05-21 11:12:19    3079    10265
2954    10265    asl@123.com        2019-05-20 14:47:35    2019-05-21 11:12:19    3080    10265
2955    10265    asl@123.com        2019-05-20 14:47:35    2019-05-21 11:12:19    3081    10265
2956    10269    NaN                2019-05-31 15:15:10    2019-05-31 15:15:10    3032    10269
2957    10270    NaN                2019-06-03 11:39:52    2019-06-03 11:39:52    3073    10270
2958 rows × 6 columns

df.join()，基于索引连接

#join，按索引连接,连接方式默认是left，一般不推荐
operator.join(house)

    id        email                create_time            update_time            ID    operator_id
0    10004    qdcsl2013@163.com    2018-02-23 15:01:46    2018-02-23 15:02:28    32    10004
1    10005    a123@vip.qq.com        2018-01-10 20:56:46    2018-01-11 10:42:22    33    10004
2    10008    qdcsl2013@163.com    2018-02-23 15:01:58    2018-02-23 15:02:38    34    10004
3    10015    343123@qq.com        2018-01-11 10:46:22    2018-01-11 16:03:43    35    10004
4    10029    123345@qq.com        2018-02-01 16:51:24    2018-02-01 16:51:24    36    10008
...    ...    ...    ...    ...    ...    ...
229    10266    NaN                    2019-05-20 17:41:23    2019-05-28 16:11:14    266    10008
230    10267    NaN                    2019-05-20 17:55:52    2019-05-28 16:11:10    267    10008
231    10268    NaN                    2019-05-27 14:40:23    2019-05-28 16:11:05    268    10004
232    10269    NaN                    2019-05-31 15:15:10    2019-05-31 15:15:10    270    10004
233    10270    NaN                    2019-06-03 11:39:52    2019-06-03 11:39:52    271    10004
234 rows × 6 columns

pd.concat()，相当于UNION，当字段名一致时会并入一列

#concat,相当于MYSQL的UNION,方法属于pd级别,默认是上下连接，axis=1则是左右连接
pd.concat([operator,house],axis=0,sort=True)

    ID    create_time            email                id        operator_id    update_time
0    NaN    2018-02-23 15:01:46    qdc@163.com        10004.0    NaN    2018-02-23 15:02:28
1    NaN    2018-01-10 20:56:46    al@vip.qq.com    10005.0    NaN    2018-01-11 10:42:22
2    NaN    2018-02-23 15:01:58    qd13@163.com    10008.0    NaN    2018-02-23 15:02:38
3    NaN    2018-01-11 10:46:22    3448@qq.com        10015.0    NaN    2018-01-11 16:03:43
4    NaN    2018-02-01 16:51:24    123345@qq.com    10029.0    NaN    2018-02-01 16:51:24
...    ...    ...    ...    ...    ...    ...
2965    3077.0    NaN    NaN    NaN    10265.0    NaN
2966    3078.0    NaN    NaN    NaN    10265.0    NaN
2967    3079.0    NaN    NaN    NaN    10265.0    NaN
2968    3080.0    NaN    NaN    NaN    10265.0    NaN
2969    3081.0    NaN    NaN    NaN    10265.0    NaN
3204 rows × 6 columns

concat用于上下合并会较为方便，比如1、2、3多个月份的数据，字段相同，则可直接合并为”第一季度”的汇总表

#concat用于上下合并会较为方便，比如1、2、3多个月份的数据，字段相同，则可直接合并为"第一季度"的汇总表
#1月销售数据
df1=pd.DataFrame({
    '品类':list('abcde'),
    '销售额':[100,200,300,400,500]
})
#2月销售数据
df2=pd.DataFrame({
    '品类':list('abcde'),
    '销售额':[100,200,300,400,500]
})
#3月销售数据
df3=pd.DataFrame({
    '品类':list('qpwor'),
    '销售额':[100,200,300,400,500]
})

pd.concat([df1,df2,df3],sort=True)

    品类    销售额
0    a    100
1    b    200
2    c    300
3    d    400
4    e    500
0    a    100
1    b    200
2    c    300
3    d    400
4    e    500
0    q    100
1    p    200
2    w    300
3    o    400
4    r    500