2.1 数据清洗

2.1.1 缺失值处理

  • 缺失值观察 ```python

    方式一:通过df.info

    df.info() ‘’’ RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns):

    Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB ‘’’

方式二:统计每个字段的缺失值的个数

df.isnull().sum() ‘’’ PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 ‘’’

  1. - 填充缺失值
  2. ```python
  3. # 方式一:指定值填充
  4. df.fillna(0)
  5. # 方式二:前向或后向填充
  6. data.fillna(method='ffill')
  7. data.fillna(method='bfill')
  • 过滤缺失值 ```python

    方式一:通过df.dropna过滤

    df.dropna()

方式二:

df = df[df.notnull()]

  1. <a name="jyUEv"></a>
  2. ### 2.1.2 重复值处理
  3. - 查看重复值
  4. ```python
  5. df[df.duplicated()]
  • 过滤重复值

    1. df = df.drop_duplicates()

    2.2 特征处理

    :::info 特征大概分为两大类:

  • 数值型特征:离散型数值特征、连续型数值特征

  • 文本型特征 :::

  • 数值型特征离(分)散(箱)化 ```python

    按变量的值进行分割 df.cut()

    df.cut(df.col, bins=n,labels = [l1, l2, l3, l4, l5])

按数据的数量进行分割 df.qcut()

pd.qcut(df.col, q=[percent1, percent2, percent3])

  1. - 文本型特征处理
  2. - 编码
  3. ```python
  4. # 普通编码
  5. ## 方式一:map
  6. df['col1'].map({'v1': 1, 'v2': 2})
  7. ## 方式二:replace
  8. df['col1'].replace(['v1','v2'],[1,2])
  9. # one-hot编码
  10. pd.get_dummies(df['col1'], prefix=feat)
  • 文本提取
    1. df['col1'].str.extract(pattern mode, expand=False)