2.1 数据清洗
2.1.1 缺失值处理
- 缺失值观察
```python
方式一:通过df.info
df.info() ‘’’RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
‘’’
方式二:统计每个字段的缺失值的个数
df.isnull().sum() ‘’’ PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 ‘’’
- 填充缺失值```python# 方式一:指定值填充df.fillna(0)# 方式二:前向或后向填充data.fillna(method='ffill')data.fillna(method='bfill')
方式二:
df = df[df.notnull()]
<a name="jyUEv"></a>### 2.1.2 重复值处理- 查看重复值```pythondf[df.duplicated()]
过滤重复值
df = df.drop_duplicates()
2.2 特征处理
:::info 特征大概分为两大类:
数值型特征:离散型数值特征、连续型数值特征
文本型特征 :::
数值型特征离(分)散(箱)化 ```python
按变量的值进行分割 df.cut()
df.cut(df.col, bins=n,labels = [l1, l2, l3, l4, l5])
按数据的数量进行分割 df.qcut()
pd.qcut(df.col, q=[percent1, percent2, percent3])
- 文本型特征处理- 编码```python# 普通编码## 方式一:mapdf['col1'].map({'v1': 1, 'v2': 2})## 方式二:replacedf['col1'].replace(['v1','v2'],[1,2])# one-hot编码pd.get_dummies(df['col1'], prefix=feat)
- 文本提取
df['col1'].str.extract(pattern mode, expand=False)
