数据集简介

数据包括如下10类信息:
Clothing ID: 服装标签,数字信息
Age: 购买者年龄,数字信息
Title: 评论标题,文字信息
Review Text: 评论内容,文字信息
Rating: 产品评价(评分由低到高用0-5表示),数字信息
Recommended IND:是否推荐(0表示不推荐,1表示推荐),数字信息
Positive Feedback Count: 正面反馈数量(认为该评价有用),数字信息
Division Name: 产品分类名称,文字信息
Department Name: 服饰类别,文字信息
Class Name: 服饰类型,文字信息
Division Name包括General、General Petite和Initmates。前两者指服饰的大小,General指正常大小,General Petite指为身材娇小的女性提供的服饰,Initmates则指私人贴身服饰。Department Name包括Tops(上装)、Bottoms(下装)等区分。Class Name则是更加细致的划分。

分析主题

1、女性顾客中哪个年龄段人群是网购主力军
2、不同年龄顾客对应哪类服装
3、各类服装评价分数如何?
4、各类服装推荐情况
4、产品评价分数与是否推荐相关吗?

读取数据

  1. #加载包
  2. import numpy as np
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  1. #利用pandas导入数据
  2. clothing_df = pd.read_csv(r"C:\Users\谭小洵\Desktop\women shopping datebase\Womens Clothing E-Commerce Reviews.csv", index_col=0)
  1. #查看数据开头
  2. clothing_df.head()
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 767 33 NaN Absolutely wonderful - silky and sexy and comf… 4 1 0 Initmates Intimate Intimates
1 1080 34 NaN Love this dress! it’s sooo pretty. i happene… 5 1 4 General Dresses Dresses
2 1077 60 Some major design flaws I had such high hopes for this dress and reall… 3 0 0 General Dresses Dresses
3 1049 50 My favorite buy! I love, love, love this jumpsuit. it’s fun, fl… 5 1 0 General Petite Bottoms Pants
4 847 47 Flattering shirt This shirt is very flattering to all due to th… 5 1 6 General Tops Blouses

数据预处理

  1. #查看缺失值,打印表格每一列信息
  2. clothing_df.info()
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 23486 entries, 0 to 23485
  3. Data columns (total 10 columns):
  4. # Column Non-Null Count Dtype
  5. --- ------ -------------- -----
  6. 0 Clothing ID 23486 non-null int64
  7. 1 Age 23486 non-null int64
  8. 2 Title 19676 non-null object
  9. 3 Review Text 22641 non-null object
  10. 4 Rating 23486 non-null int64
  11. 5 Recommended IND 23486 non-null int64
  12. 6 Positive Feedback Count 23486 non-null int64
  13. 7 Division Name 23472 non-null object
  14. 8 Department Name 23472 non-null object
  15. 9 Class Name 23472 non-null object
  16. dtypes: int64(5), object(5)
  17. memory usage: 2.0+ MB
  1. #方法2查看缺失值
  2. clothing_df.isnull().sum()
  1. Clothing ID 0
  2. Age 0
  3. Title 3810
  4. Review Text 845
  5. Rating 0
  6. Recommended IND 0
  7. Positive Feedback Count 0
  8. Division Name 14
  9. Department Name 14
  10. Class Name 14
  11. dtype: int64
  1. #去掉丢失值对应的行,Division Nmae, Department Name, Class Name
  2. clothing_df.dropna(axis=0, subset=['Division Name', 'Department Name', 'Class Name'],inplace=True)
  1. pd.isnull(clothing_df).sum()
  1. Clothing ID 0
  2. Age 0
  3. Title 3809
  4. Review Text 844
  5. Rating 0
  6. Recommended IND 0
  7. Positive Feedback Count 0
  8. Division Name 0
  9. Department Name 0
  10. Class Name 0
  11. dtype: int64
  1. #将所有的缺失值替换为字符串NULL
  2. clothing_df.fillna('NULL',inplace=True)
  3. clothing_df.isnull().sum()
  1. Clothing ID 0
  2. Age 0
  3. Title 0
  4. Review Text 0
  5. Rating 0
  6. Recommended IND 0
  7. Positive Feedback Count 0
  8. Division Name 0
  9. Department Name 0
  10. Class Name 0
  11. dtype: int64
  1. clothing_df.describe()
Clothing ID Age Rating Recommended IND Positive Feedback Count
count 23472.000000 23472.000000 23472.000000 23472.000000 23472.000000
mean 918.486665 43.200707 4.195552 0.822256 2.537151
std 202.727678 12.280913 1.110188 0.382305 5.703597
min 0.000000 18.000000 1.000000 0.000000 0.000000
25% 861.000000 34.000000 4.000000 1.000000 0.000000
50% 936.000000 41.000000 5.000000 1.000000 1.000000
75% 1078.000000 52.000000 5.000000 1.000000 3.000000
max 1205.000000 99.000000 5.000000 1.000000 122.000000

在describe的结果中,年龄的均值和中位数41.00的差距不大,所以年龄分布基本符合正态分布;Rating的中位数是5.00,所以大部分消费者都给与5分(最高分)评价;Recommended IND的分位数显示该值大部分为1,所以给予推荐评价的消费者居多;Positive Feedback Count中最大值高达122,最小值为0,中位数为1,因此消费者通常不会去评价他人的反馈信息。

  1. #查看年龄分布
  2. clothing_df['Age'].plot(kind='hist',bins=10)
  1. <AxesSubplot:ylabel='Frequency'>

女性购物服装反馈分析 - 图1

  1. np.unique(clothing_df.Age[clothing_df.Age>=80],return_counts=True)
  1. (array([80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 99],
  2. dtype=int64),
  3. array([10, 5, 13, 43, 6, 6, 2, 4, 5, 2, 5, 1, 2, 3, 2],
  4. dtype=int64))

数据集中包含年龄数据。根据上面中年龄数据,虽然数据中出现超过80岁的年龄数据,但是年龄分布图呈现正态分布,因此假定所有的年龄为真实年龄。

分析

1、哪个年龄段顾客是主力军

  1. plt.rcParams['font.sans-serif'] = ['SimHei']#设置可以显示中文
  2. plt.rcParams['axes.unicode_minus'] = False #正常显示负号
  1. Age1 = pd.cut(clothing_df['Age'],bins=[0,18,28,38,48,58,100])
  2. age_counts=pd.value_counts(Age1)
  3. age_counts
  1. (28, 38] 6941
  2. (38, 48] 6902
  3. (48, 58] 4114
  4. (58, 100] 3099
  5. (18, 28] 2412
  6. (0, 18] 4
  7. Name: Age, dtype: int64
  1. plt.figure(figsize=(6,6),dpi=100)
  2. labels=['28-38岁','38-48岁','48-58岁','58-100岁','18-28岁','0-18岁']
  3. plt.pie(age_counts,autopct='%1.1f%%',labels=labels)
  4. plt.show
  1. <function matplotlib.pyplot.show(close=None, block=None)>

女性购物服装反馈分析 - 图2
28-48岁人群占比达59%,为网购主力军
2、不同年龄顾客对应哪类服装

  1. #根据Division Name划分
  2. clothing_df.query('0 < Age & Age <= 18').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 4
  1. clothing_df.query('18 < Age & Age <= 28').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 1396
General Petite 811
Initmates 205
  1. clothing_df.query('28 < Age & Age <= 38').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 4031
General Petite 2369
Initmates 541
  1. clothing_df.query('38 < Age & Age <= 48').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 4130
General Petite 2392
Initmates 380
  1. clothing_df.query('48 < Age & Age <= 58').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 2416
General Petite 1488
Initmates 210
  1. clothing_df.query('58 < Age & Age <= 100').groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 1873
General Petite 1060
Initmates 166
  1. clothing_df.groupby('Division Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Division Name
General 13850
General Petite 8120
Initmates 1502

超过18岁购物者中,均是尺寸为正常大小,稍小销量更高,可以考虑进货时多考虑这2种,三者比例大致为9:5:1。

  1. #根据Department Name划分
  2. clothing_df.query('0 < Age & Age <= 18').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Dresses 2
Bottoms 1
Tops 1
  1. clothing_df.query('18 < Age & Age <= 28').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 995
Dresses 714
Bottoms 355
Intimate 231
Jackets 105
Trend 12
  1. clothing_df.query('18 < Age & Age <= 28').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 995
Dresses 714
Bottoms 355
Intimate 231
Jackets 105
Trend 12
  1. clothing_df.query('28 < Age & Age <= 38').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 2870
Dresses 1978
Bottoms 1139
Intimate 609
Jackets 314
Trend 31
  1. clothing_df.query('38 < Age & Age <= 48').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 3073
Dresses 1897
Bottoms 1184
Intimate 457
Jackets 257
Trend 34
  1. clothing_df.query('38 < Age & Age <= 48').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 3073
Dresses 1897
Bottoms 1184
Intimate 457
Jackets 257
Trend 34
  1. clothing_df.query('48 < Age & Age <= 58').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 1945
Dresses 1048
Bottoms 658
Intimate 251
Jackets 184
Trend 28
  1. clothing_df.query('58 < Age & Age <= 100').groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 1584
Dresses 680
Bottoms 462
Intimate 187
Jackets 172
Trend 14
  1. clothing_df.groupby('Department Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Department Name
Tops 10468
Dresses 6319
Bottoms 3799
Intimate 1735
Jackets 1032
Trend 119

各个年龄段均已上衣,连衣裙,下装,贴身服饰为主。

  1. #根据Class Name划分
  2. clothing_df.query('0 < Age & Age <= 18').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Dresses 2
Knits 1
Skirts 1
  1. clothing_df.query('18 < Age & Age <= 28').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Dresses 714
Knits 478
Blouses 305
Sweaters 121
Pants 119
Skirts 103
Jeans 92
Fine gauge 91
Lounge 75
Jackets 73
Swim 57
Shorts 39
Intimates 35
Outerwear 32
Sleep 27
Legwear 20
Layering 17
Trend 12
Casual bottoms 2
  1. clothing_df.query('28 < Age & Age <= 38').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Dresses 1978
Knits 1417
Blouses 808
Pants 389
Sweaters 375
Jeans 337
Skirts 296
Fine gauge 270
Lounge 222
Jackets 219
Swim 160
Shorts 117
Outerwear 95
Sleep 62
Legwear 62
Layering 52
Intimates 50
Trend 31
Chemises 1
  1. clothing_df.query('38 < Age & Age <= 48').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Dresses 1897
Knits 1393
Blouses 903
Pants 434
Sweaters 426
Jeans 390
Fine gauge 351
Skirts 272
Lounge 189
Jackets 179
Shorts 88
Swim 85
Outerwear 78
Sleep 70
Legwear 41
Layering 37
Intimates 35
Trend 34
  1. clothing_df.query('48 < Age & Age <= 58').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Dresses 1048
Knits 843
Blouses 605
Sweaters 271
Pants 249
Fine gauge 226
Jeans 198
Skirts 161
Lounge 125
Jackets 116
Outerwear 68
Shorts 50
Sleep 35
Swim 30
Trend 28
Layering 25
Legwear 19
Intimates 17
  1. clothing_df.query('58 < Age & Age <= 100').groupby('Class Name').agg( 销量 = ('Clothing ID','count')).sort_values(by='销量',ascending = False)
销量
Class Name
Knits 711
Dresses 680
Blouses 476
Sweaters 235
Pants 197
Fine gauge 162
Jeans 130
Jackets 117
Skirts 112
Lounge 80
Outerwear 55
Sleep 34
Legwear 23
Shorts 23
Swim 18
Intimates 17
Layering 15
Trend 14

各个年龄段位于前三位销量均为连衣裙、针织衫、衬衫。
3、各类服装评价分数如何?

  1. #根据Division Name划分
  2. clothing_df.groupby('Division Name').agg( 评价分数 = ('Rating','mean')).sort_values(by='评价分数',ascending = False)
评价分数
Division Name
Initmates 4.286285
General Petite 4.211084
General 4.176606
  1. #根据Department Name划分
  2. clothing_df.groupby('Department Name').agg( 评价分数 = ('Rating','mean')).sort_values(by='评价分数',ascending = False)
评价分数
Department Name
Bottoms 4.288760
Intimate 4.280115
Jackets 4.264535
Tops 4.172239
Dresses 4.150815
Trend 3.815126
  1. #根据Class Name划分
  2. clothing_df.groupby('Class Name').agg( 评价分数 = ('Rating','mean')).sort_values(by='评价分数',ascending = False)
评价分数
Class Name
Casual bottoms 4.500000
Layering 4.376712
Jeans 4.360942
Lounge 4.301013
Jackets 4.295455
Sleep 4.285088
Intimates 4.279221
Legwear 4.278788
Pants 4.265850
Fine gauge 4.260909
Shorts 4.255521
Skirts 4.245503
Outerwear 4.198171
Swim 4.197143
Sweaters 4.179272
Knits 4.161677
Blouses 4.154020
Dresses 4.150815
Chemises 4.000000
Trend 3.815126

评分分为1-5分,整体上均有4分,可以说产品整体而言均不错。
4、各类服装推荐情况

  1. #根据Division Name划分
  2. clothing_df.groupby('Division Name').agg( 推荐分数 = ('Recommended IND','mean')).sort_values(by='推荐分数',ascending = False)
推荐分数
Division Name
Initmates 0.852197
General Petite 0.825985
General 0.816823
  1. #根据Department Name划分
  2. clothing_df.groupby('Department Name').agg( 推荐分数 = ('Recommended IND','mean')).sort_values(by='推荐分数',ascending = False)
推荐分数
Department Name
Bottoms 0.851277
Intimate 0.850144
Jackets 0.836240
Tops 0.815151
Dresses 0.808197
Trend 0.739496
  1. #根据Class Name划分
  2. clothing_df.groupby('Class Name').agg( 推荐分数 = ('Recommended IND','mean')).sort_values(by='推荐分数',ascending = False)
推荐分数
Class Name
Chemises 1.000000
Casual bottoms 1.000000
Layering 0.883562
Jeans 0.881430
Legwear 0.860606
Lounge 0.859624
Intimates 0.857143
Sleep 0.855263
Skirts 0.845503
Jackets 0.845170
Shorts 0.839117
Fine gauge 0.837273
Pants 0.832853
Knits 0.817675
Outerwear 0.817073
Blouses 0.810139
Dresses 0.808197
Swim 0.805714
Sweaters 0.800420
Trend 0.739496
  1. clothing_df.groupby('Class Name').agg( 销量合计 = ('Clothing ID','sum')).sort_values(by='销量合计',ascending = False)
销量合计
Class Name
Dresses 6871680
Knits 4200953
Blouses 2576796
Pants 1441937
Sweaters 1334383
Jeans 1179731
Fine gauge 991525
Skirts 950642
Jackets 686557
Outerwear 364718
Lounge 341630
Swim 145350
Shorts 137723
Trend 105554
Sleep 71607
Layering 62661
Intimates 59370
Legwear 35793
Casual bottoms 99
Chemises 10

各类服装评分整体约为0.8,也就是说每卖出一件商品大约有80%的概率会认为它值得推荐,产品选品还是不错的。
5、相关性矩阵

  1. df1 = clothing_df.copy()
  2. df1.head()
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 767 33 NULL Absolutely wonderful - silky and sexy and comf… 4 1 0 Initmates Intimate Intimates
1 1080 34 NULL Love this dress! it’s sooo pretty. i happene… 5 1 4 General Dresses Dresses
2 1077 60 Some major design flaws I had such high hopes for this dress and reall… 3 0 0 General Dresses Dresses
3 1049 50 My favorite buy! I love, love, love this jumpsuit. it’s fun, fl… 5 1 0 General Petite Bottoms Pants
4 847 47 Flattering shirt This shirt is very flattering to all due to th… 5 1 6 General Tops Blouses
  1. df2 = df1.corr(method='pearson')
  2. df2
Clothing ID Age Rating Recommended IND Positive Feedback Count
Clothing ID 1.000000 0.017322 -0.017626 -0.015414 0.045875
Age 0.017322 1.000000 0.026967 0.030712 0.043049
Rating -0.017626 0.026967 1.000000 0.792311 -0.064820
Recommended IND -0.015414 0.030712 0.792311 1.000000 -0.068954
Positive Feedback Count 0.045875 0.043049 -0.064820 -0.068954 1.000000
  1. import matplotlib as mpl
  2. import seaborn as sns
  1. plt.figure(figsize=(10,6),dpi=100)
  2. sns.heatmap(df2,xticklabels=df2.columns,yticklabels=df2.columns,cmap='RdBu',center=-1,annot=True)
  3. plt.title('购物反馈相关性矩阵',fontsize=22)
  4. plt.xticks(fontsize=12, rotation=45, horizontalalignment='right')
  5. plt.yticks(fontsize=12)
  1. (array([0.5, 1.5, 2.5, 3.5, 4.5]),
  2. [Text(0, 0.5, 'Clothing ID'),
  3. Text(0, 1.5, 'Age'),
  4. Text(0, 2.5, 'Rating'),
  5. Text(0, 3.5, 'Recommended IND'),
  6. Text(0, 4.5, 'Positive Feedback Count')])

女性购物服装反馈分析 - 图3
图中可以看出Rating(产品评价)和Recommended IND(是否推荐)相关性较高,可多加关注这两个指标,有利于自家产品通过消费者自发传播。

总结

  1. 1、女性网购主要集中在28-48岁。
  2. 2、上衣,连衣裙,下装,贴身服饰可以多进货,三者比例可根据过往做个大致推断,尺寸则以正常大小及稍小为主,连衣裙、针织衫、衬衫购买者较多。
  3. 3、产品选品整体不错。
  4. 4、产品评价和是否推荐相关性较高。